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Abstract 


Parallelism  in  scientific  applications  can  most  often  be  found  at  the  loop  level. 
Although  Ada  supports  parallelism  via  the  task  construct,  its  coarseness  renders 
it  unsuitable  for  this  light-weight  parallelism.  In  this  work,  we  propose  Ada 
constructs  to  achieve  efficient  loop-level  paralleUsm  in  ANSI-Ada.  This  is  ac- 
complished in  two  steps.  First,  we  present  an  idiom  that  allows  the  specification 
of  light-weight  tasks.  Second,  we  give  an  efficient  implementation  of  this  idiom 
that  is  considerably  more  efficient  than  a  standard  Ada  task. 

In  addition,  we  present  an  idiom  that  makes  the  fetch_and_add  synchroniza- 
tion primitive  available  at  the  Ada  level.  Our  implementation  of  this  idiom  is 
more  efficient  in  both  time  and  space  than  previous  results.  In  addition  to  pro- 
viding universal  synchronization,  using  fetch.and.add  simplifies  program  andysis 
(e.g.  proving  the  absence  of  race  conditions  in  the  implementation  of  a  parcdlel 
algorithm).  Since  aJl  these  idioms  are  written  in  standard  Ada,  they  maintain 
the  portabihty  that  is  central  to  the  mandated  uses  of  the  language. 
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Chapter  1 


Introduction 


More  and  more  computationcdly  intensive  applications  are  being  executed  on 
parallel  machines  [AG89].  These  machines  achieve  their  increased  processing 
power  by  exploiting  the  parallelism  present  in  an  application  and  distributing  it 
among  multiple  processing  elements.  Since  most  of  these  applications  have  been 
written  for  sequential  machines,  their  implementation  on  parallel  architectures 
requires  new  approaches  to  algorithmic  development. 

One  approach,  the  field  of  Parallelizing  Compilers  [Hin88,ACK87,ABC+87] 
takes  programs  written  for  sequential  architectures  and  automatically  converts 
them  to  semantically  equivalent  parallel  programs.  TypiceJly,  these  compilers 
take  as  input  a  sequential  FORTRAN  program  (FORTRAN  still  being  the  prevalent 
language  for  scientific  computation)  and  produce  a  FORTRAN  program  with  ap- 
propriate parallel  constructs  added.  Although  great  success  heis  been  achieved 
in  the  related  field  of  Vectorizing  Compilers  [AK84,KKLW84,Col87],  compilers 
that  generate  code  for  vector  machines,  it  remains  to  be  seen  whether  paralleliz- 
ing compilers  can  detect  and  exploit  a  sufficient  amount  of  parallehsm  at  the 
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doall  i  =  1,  N 

endall 
Figure  1.1:  A  ParFOR  Code  Fragment 

level  of  granularity  required  by  multiprocessors  [Ken88]. 

An  alternative  approach  is  to  allow  the  programmer  the  full  power  to  express 
the  parallelism  of  an  algorithm  directly,  by  writing  it  in  a  parallel  programming 
language  [Bab88,Qui87].  This  approach  not  only  lets  the  programmer  decide  the 
appropriate  granularity  of  parallelism,  but  also  how  and  where  this  parallelism 
should  be  employed  in  the  program.  The  manner  in  which  the  programmer  is 
allowed  to  specify  the  parallehsm  in  an  algorithm  is  a  very  important  decision. 

An  increasing  amount  of  programs  are  being  written  for  parallel  machines 
using  languages  that  are  essentially  sequential  languages  onto  which  parallel  con- 
structs have  been  grafted  [IBM88,Ber88,Inc81,Ost89].  Since  each  parallel  ma- 
chine often  provides  its  own  idiosyncratic  parallel  constructs,  portability  across 
machines  is  troublesome.  Although  an  effort  hcis  been  undertaken  to  agree  on 
common  parallel  constructs  for  multiprocessors  [PCF88],  it  remains  to  be  seen 
whether  this  work  will  develop  into  an  industry-wide  standard. 

Figure  1.1  presents  a  code  fragment  written  in  a  version  of  FORTRAN  that 
has  been  modified  to  support  parallelism  [Ber88].  The  semantics  of  the  doall 
construct  specify  that  each  iteration  of  the  loop  may  be  executed  in  parallel.  This 
medium  grain  or  loop-level  parallelism  is  the  most  common  form  of  parallelism 
used  in  scientific  applications.  Each  iteration  of  the  loop  executes  the  same  set 
of  instructions  with  only  the  loop  index  var)ring.  Since  each  iterate  can  be  given 
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to  a  parallel  thread  before  it  starts  to  execute,  no  synchronization  is  required 
between  starting  and  terminating  each  parallel  thread.  This  characteristic  allows 
this  type  of  parallelism  to  be  implemented  efficiently  on  a  variety  of  machines. 

Ada  is  a  modern  programming  language  which  supports  paralleHsm  as  well 
as  programming  in  the  large,  data  abstraction,  strong  typing,  and  exception 
handling.  An  ANSI  standard  [Ada83]  is  intended  to  ensure  portability  across 
implementations.  Many  compilers  exist  for  Ada  on  sequential  machines  and 
efforts  have  been  undertaken  to  make  Ada  available  on  parallel  architectures  as 
well  [Ric89,AL89,CCB89].  For  these  reasons,  Ada  appears  to  be  an  excellent 
choice  for  algorithms  geared  toward  parallel  architectures. 

The  parallel  construct  of  Ada  is  the  task.  It  is  a  powerful  generalization 
of  coroutines,  that  supports  coarse-grain  parallelism,  as  well  as  synchroniza- 
tion and  communication  (both  conditionjJ  and  unconditiontd)  among  otherwise 
asynchronous  processes.  Due  to  its  generality,  a  significant  amount  of  over- 
head is  required  to  activate  and  terminate  a  task  [BN87,Ard87,Hil82b].  This 
overhead  makes  the  Ada  task  seemingly  inappropriate  for  loop-level  parallelism 
[Yem82,Bur85,Jha90]. 

However,  the  attractiveness  of  programming  in  Ada  with  its  desirable  fea- 
tures (exception  handling,  programming  in  the  large,  strong  typing,  etc.)  sug- 
gests that  if  the  overhead  associated  with  task  initiation  and  termination  were 
eliminated,  Ada  would  be  a  language  of  choice  for  the  largest  scientific  com- 
putations -  an  area  of  apphcation  where  FORTRAN  continues  to  reign.  Frankel 
[Fra87]  postulates, 
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"Ada  tasking  can  be  made  as  efficient  as  any  other  method  of  con- 
currency programming,  but  the  generality  of  the  Ada  model  requires 
special  cases  to  be  extracted  and  used." 

This  work  will  focus  on  making  loop-level  Ada  tasking  efficient  by  using  pro- 
gram transformations  similar  to  those  described  in  [SS85],  [Hil82a]  and  [Hil82b]. 
Using  these  methods,  we  are  able  to  achieve  an  efficient  implementation  without 
sacrificing  portability. 

Chapter  2  provides  a  background  for  this  work;  it  discusses  our  multipro- 
cessor model  and  the  fetch_and-add  primitive.  It  also  gives  an  overview  of  the 
Ada  tasking  model  and  introduces  three  shortcomings  in  Ada:  the  lack  of  the 
ability  to  distribute  identities  to  tasks  in  parallel,  the  synchronization  point  that 
is  required  during  task  initiation,  and  the  high  overhead  required  to  manage  a 
task's  status. 

Chapter  3  not  only  solves  the  first  of  these  two  problems,  but  also  makes  the 
fetch.and.add  primitive  available  at  the  Ada  level.  By  doing  this  we  provide  the 
user  with  a  efficient  synchronization  mechanism  that  can  be  used  to  prove  an 
algorithm  is  free  of  race  conditions. 

Chapter  4  provides  a  solution  for  obtaining  efficient  loop-level  parallelism. 
This  solutions  is  illustrated  by  the  examples  in  Chapter  A.  Chapter  7  concludes 
this  work  and  suggest  some  future  research  directions. 

The  expected  result  of  this  work  is  to  widen  substantially  the  use  of  Ada  in 
large  scale  scientific  programming  on  current  and  future  multiprocessor  circhi- 
tectures.  One  niay  even  hope  for  a  progressive  replacement  of  FORTRAN  in  this 
domain  of  apphcations,  even  though  past  history  gives  us  the  slimmest  of  hopes 
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in  this  regard.  Geoffrey  Hunter  [Hun90]  postulates  that, 

"The  only  technically  rational  way  of  advancing  the  art  of  scientific 
and  engineering  programming  is  to  abandon  FORTRAN  in  favor  of  a 
modern,  block-structured  language  such  as  Algol-68  or  Ada." 

We  beheve  this  work  will  help  towards  this  transition. 


Chapter  2 


Background 


This  chapter  is  divided  into  three  sections,  providing  the  background  for  this 
work.  The  first  section  describes  the  machine  model.  The  second  section  dis- 
cusses the  Ada  tasking  model,  highhghting  its  advantages  as  well  as  its  short- 
comings. We  conclude  this  chapter  by  discussing  the  NYU  Ada/Ed  system  and 
related  work. 

2.1      The  Machine  Model 

As  our  work  deals  with  practical  parallelism,  our  underlying  machine  contciins 
multiple  processors.  A  fundamental  decision  in  designing  a  miiltiprocessor  is  the 
placement  of  memory.  In  a  shared  memory  machine  a  common  memory  can  be 
accessed  by  each  processor.  In  a  distributed  memory  machine  each  processor  has 
its  only  local  memory;  no  shzo-ed  memory  exists.  Since  both  architectures  pro- 
vide the  same  computational  power,  the  main  issue  is  efficiency.   A  distributed 
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architecture  allows  faster  access  to  local  memory,  while  rendering  non-local  ac- 
cesses more  costly.  A  shared  memory  machine  provides  uniform  access  to  all  of 
memory,  but  this  access  cannot  be  performed  in  constant  time. 

We  follow  the  work  of  Hummel  [Hum88]  and  assume  an  asynchronous,  shared 
memory  multiprocessor  as  our  model.  The  IBM  RP3  [PBG  +  SS]  and  its  prede- 
cessor, the  NYU  Ultracomputer  [GGK+SS],  are  examples  of  machines  that  fall 
under  this  model. 

2.1.1      The  Fetch^nd_Add  Primitive 

Our  work  utihzes  the  fetch.and.add  primitive.  The  fetch.and.add  primitive  is 
a  universal  coorchnation  primitive  that  provides  an  efficient  means  of  synchro- 
nization among  many  asynchronous  processors  while  avoiding  "hot  spots".  The 
fetch.and.add  primitive  can  be  used  as  a  solution  to  the  mutual  exclusion  prob- 
lem, the  readers-writers  problems,  and  other  problems  dealing  with  parallel  ac- 
tivities [GLRSSj.  In  addition,  it  can  simplify  program  analysis;  it  presents  a 
form  of  synchronization  that  can  be  used  to  prove  an  algorithm  is  free  of  race 
conditions. 

The  semantics  of  the  fetch.and.add  primitive  are  as  follows:  if  a  process 
executes  fetch. and.add(shared_var,  inc),  the  value  of  shared.var  is  incremented 
by  inc,  and  the  original  value  is  returned.  These  two  operations  are  performed 
atomically.  If  two  processors  execute  fetch.and.add  operations  on  shared.var, 
the  result  is  that  shared  var  is  incremented  by  the  sum  of  the  two  increments 
and  one  processor  is  returned  the  original  value  of  shared.var,  while  the  other 
receives  the  sum  of  the  original  value  and  the  first  processor's  increment. 
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For  example,  suppose  F  is  a  shared  variable  and  processor  i  executes 

Xi  :=  fetch. and.add(F,e.), 

while  processor  _;  executes 

Xj  :=fetch.and_add(V,e,). 

If  V  is  not  updated  by  any  other  processor,  then  the  resvdt  of  these  two  opera- 
tions is  either 

X,  :-  V 

X,  :=V  +  e. 

or 

Xi:=V  +  ej 
Xj  :=  V 

As  a  further  illustration  of  the  usefulness  of  the  fetch_and_add  primitive  we 
consider  the  following  problem:  suppose  we  wish  to  distribute  N  entries  of  an 
array  to  N  distinct  threads  in  parallel.  By  using  the  fetch_and_add  primitive, 
we  can  solve  this  problem  in  a  straightforward  manner.  First,  initialize  a  shared 
variable  to  the  value  of  the  first  index,  rii,  of  the  array.  Next  have  each  of  the  N 
processors  execute  a  fetch_and.add  operation  on  this  variable  with  a  unit  incre- 
ment. By  the  semantics  of  the  fetch_and_add  primitive,  each  processor  receives 
a  unique  index  value  ranging  from  nj  to  rii  -\-  N  —  1. 

We  describe  the  implementation  of  the  fetch_and_add  primitive  in  the  next 
section. 
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2.1.2      Memory  Hierarchy 

An  important  consideration  in  constructing  a  shared  memory  machine  is  decid- 
ing how  processors  are  connected  to  memory.  Although  we  assume  a  configu- 
ration in  the  spirit  of  the  Ultracomputer  and  RP3,  our  work  can  be  tailored  to 
alternative  configurations.  In  what  follows  we  discuss  the  memory  hierarchies 
found  in  the  Ultracomputer  and  RP3. 

To  facilitate  massive  parallelism  in  a  uniform  way,  processors  are  connected 
to  shared  memory  via  a  connection  network.  Both  the  RP3  and  Ultracomputer 
utilize  an  Omeja-network  for  this  purpose.  An  omega  network  provides  pro- 
cessors with  uniform  access  to  shared  memory.  In  addition,  an  omega  network 
provides  a  very  efficient  processor/memory  routing  scheme  [GLR83]. 

In  a  traditional  omega  network  (as  used  in  the  Ultracomputer)  processors 
reside  on  one  side  of  the  network  with  the  memory  units  on  the  opposite  side 
[Law75].  Since  a  path  of  log  A'^  switches  connects  each  processor  to  shared  mem- 
ory, all  shared  memory  requests  require  log  iV  steps.  In  an  attempt  to  reduce 
the  amount  of  network  accesses,  the  designers  of  the  RP3  cilso  connect  each  pro- 
cessor directly  to  a  memory  unit.  If  a  processor  references  shared  memory  of  the 
memory  unit  that  it  is  connected  to,  the  access  is  performed  locally,  avoiding 
a  network  traversal.  If  a  processor  references  shared  memory  that  is  not  in  its 
companion  memory  unit,  then  the  network  is  used  to  satisfy  this  request. 

In  addition  to  this  local  memory,  a  cache  is  often  provided  to  further  reduce 
network  accesses.  In  the  RP3  this  cache  can  be  memaged  by  software,  allowing 
for  both  greater  flexibility  and  responsibihty  in  coordinating  data  placement. 
For  more  details  sec  [BMW85]. 
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While  the  fetch_and_add  primitive  can  be  implemented  in  software  using  locks 
and/or  critical  sections,  machines  such  as  RP3  and  Ultracomputer  implement 
this  primitive  directly  in  hardware.  This  is  accomplished  by  adding  extra  logic 
to  the  switches  in  the  omega  network.  This  logic  has  the  abihty  to  combine  mul- 
tiple fetch_and_add  operations  on  the  same  variable,  substituting  one  operation 
with  an  increment  specified  by  the  sum  of  the  increments  of  the  requests.  By 
performing  this  combining  in  the  network,  multiple  fetch_and_adds  are  satisfied 
in  the  time  it  takes  for  one  shared  memory  access. 

Consider  the  scenario  where  all  TV  processors  simultaneously  issues  a  fetch- 
_and.add  operation  on  the  same  variable.  Without  the  presence  of  a  combining 
network,  hot  spots  will  arise  at  certain  nodes  of  the  network.  By  combining  these 
requests,  all  N  requests  can  be  satisfied  in  the  time  it  takes  for  one  fetch_and.add 
operation. 

2.2      Ada  Tasking 

In  this  section  we  present  a  brief  overview  of  the  Ada  tasking  model.  In  addition 
to  highlighting  some  of  its  features,  we  illustrate  three  deficiencies  of  this  model 
that  arise  when  one  attempts  to  express  loop-level  paxallehsm.  Other  features 
of  the  language  are  illustrated  in  subsequent  chapters. 

Parallelism  in  Ada  is  expressed  by  using  the  task  construct  (ARM  9(1-2)).^ 
A  task  is  an  entity  that  executes  in  parallel  with  other  tasks.     To  facilitate 


^The  semantics  of  the  language  are  defined  by  the  Ada  Reference  Manual  [Ada83].  As  many 
references  are  made  to  this  document  throughout  this  work,  we  employ  a  shorthand  in  referring 
to  a  particular  section.  We  use  "ARM  c.s(i)"  to  correspond  to  item  i  of  section  s  located  in 
chapter  c  of  the  Ada  Reference  Manual. 
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communication  between  two  otherwise  asynchronous  tasks,  Ada  allows  a  task  to 
declare  one  or  several  entry  points.  An  entry  may  be  ccJled  by  another  task  to 
participate  in  a  rendezvous  with  the  first  task.  The  task  that  declares  this  entry 
can  specify  when  in  its  execution  it  wishes  the  rendezvous  to  occur  by  executing 
an  accept  statement  naming  the  appropriate  entry.  Likewise,  the  calhng  task 
can  specify  the  point  at  which  it  wishes  to  call  an  entry  by  executing  an  entry 
call.  An  entry  may  have  parameters  dissociated  with  it  in  the  same  manner  as 
subprograms,  making  the  rendezvous  mechanism  a  means  of  communication  as 
well  as  synchronization. 

When  two  tasks  perform  an  entry  call  and  accept  statement  for  the  same  en- 
try, a  rendezvous  occurs  (ARM  9.5(14));  the  calhng  task  suspends  its  execution 
while  the  body  of  the  accept  statement  is  executed.^  Upon  completion  of  this 
rendezvous,  both  tasks  continue  their  independent  execution  at  the  point  imme- 
diately following  the  body  of  the  accept  statement  and  entry  call,  respectively. 

Figure  2.1  shows  an  example  of  how  two  tasks  can  synchronize  access  to  a 
shared  variable  by  using  the  rendezvous  mechanism.  In  this  example,  tasks  tl 
and  t2  coordinate  their  accessing  of  the  global  variable,  shared,  by  synchronizing 
via  the  sync  entry  located  in  task  tl.  The  programmer  has  deemed  that  up  until 
t2  ccdls  this  entry,  it  is  allowed  to  write  to  shared.  After  this  has  occurred,  t2 
signals  to  tl  that  it  may  now  access  this  variable  exclusively  by  calling  the  sync 
entry. 


'We  assume  that  no  other  task  has  executed  an  entry  call,  prior  to  this  entry  call,  for  the 
same  entry. 
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shared  :  integer; 

task  tl  is 

entry  sync; 
end  tl; 

task  body  tl  is 

X  :  integer; 
begin 

accept  sync  do 
null; 

end  sync; 

X  :=  shared;  —  shared  is  declared  globally 
end  tl; 

task  body  t2  is 
begin 

shared  :=  5; 

tl.sync; signal  to  tl  it  can  now  use  shared. 

end  t2; 

Figure  2.1:  A  Rendezvous  used  for  Synchronization 
The  Select  Statement 

A  select  statement  can  be  used  to  non-deterministically  accept  one  of  several 
potential  entry  calls  (ARM  9.7.1).  Consider  Figure  2.2  which  shows  an  example 
of  a  select  statement.  In  this  example  the  student  task  wishes  to  accept  either 
one  of  two  entry  calls.  This  can  be  accomplished  by  using  a  select  statement. 
Upon  reaching  this  statement,  a  check  is  made  to  determine  if  any  tasks  have 
called  either  the  study  or  have.fun  entries.  If  only  one  of  these  entries  has  a  caller 
waiting,  that  entry  is  accepted  and  a  rendezvous  occurs.  If  neither  entry  has  a 
caller  waiting,  execution  blocks  until  a  call  on  one  of  these  entries  occurs.  If  at 
least  one  ccJler  exists  for  each  of  these  entries,  one  is  chosen  non-deterministically 
and  the  rendezvous  is  performed.  At  the  conclusion  of  the  rendezvous  the  select 
statement  is  complete. 
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task  student  is 
begin 
select 

accept  study  do 

end  study; 
or 

accept  have.fun  do 

end  have.fun; 
end  select; 

end  student; 

Figure  2.2:  An  Example  of  a  Selective  Waiit  Statement 

In  addition  to  this  simple  form,  three  other  forms  of  select  statements  are 
avciilable.  In  Figure  2.3,  we  consider  a  select  statement  with  a  terminate  aJter- 
native.  This  figure  illustrates  an  example  of  a  monitor  of  a  protected  variable, 
local.item.  All  accesses  to  this  variable  are  made  by  calling  the  read  or  write 
entries  of  the  task  buffer. 

Consider  the  body  of  the  buffer  task.  After  accepting  a  write  entry  call 
to  initialize  the  local  variable,  an  apparently  infinite  loop  containing  a  select 
statement  with  a  terminate  alternative  is  entered.  The  semantics  of  this  select 
are  as  follows:  each  of  the  entries  corresponding  to  the  accept  statements  is 
evaluated  to  determine  if  a  task  has  caJled  that  entry.  If  one  or  more  of  these 
entries  has  a  task  waiting,  a  rendezvous  occurs  with  one  of  these  waiting  tasks 
as  demonstrated  in  the  simple  version  of  the  select  statement. 

At  the  conclusion  of  the  rendezvous,  the  select  statement  has  completed  its 
execution.  The  next  iteration  of  the  loop  is  executed  and  the  process  contin- 
ues.   Without  the  presence  of  the  terminate  alternative,  this  task  would  never 
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task  buffer  is 

entry  \A/rite(ln_item:  in  item); 

entry  read(out.item:  out  item); 
end  buffer; 

task  body  buffer  is 
localJtem:  item; 
begin 

accept  write(in.item:  in  item)  do 

localJtem  :=  inJtem; 
end  write; 
loop 

select 

accept  read(outjtem:  out  item)  do 

out.item  :=  localJtem; 
end  read; 
or 

accept  write(inJtem:  in  item)  do 

localJtem  :=  in_item; 
end  write; 
or 

terminate; 
end  select; 
end  loop; 
end  buffer; 

Figure  2.3:  An  Example  of  a  Selective  Wait  Statement 

terminate,  endlessly  looping,  waiting  for  entry  calls.  However,  the  terminate  al- 
ternative specifies  that  upon  reaching  the  select  statement,  if  all  potentiaJ  callers 
have  either  terminated  or  reached  a  select  statement  with  a  terminate  alterna- 
tive, then  this  task  terminates.  The  apparently  infinite  loop  of  Figure  2.3  is 
indeed  finite.  If  upon  reaching  the  select  statement  no  entry  calls  are  pending, 
but  potential  callers  still  exist,  this  task  does  not  terminate;  it  waits  for  either 
an  entry  call  or  for  all  potential  callers  to  terminate  or  reach  a  select  with  a 
terminate  alternative. 

The  select  statement  can  also  be  used  for  entry  calls.   We  avoid  a  detailed 
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description  here,  but  instead  focus  on  other  issues  concerning  the  Ada  tasking 
model.  See  ARM  9.7.2  and  ARM  9.7.3  for  more  details. 

2.2.1  The  Suitability  of  Ada  for  Shared  Memory  Multi- 
processors 

Although  it  may  appear  that  Ada's  tasking  model  is  designed  with  a  distributed 
memory  architecture  in  mind,  it  is  nevertheless  well-suited  for  shared  memory 
multiprocessors.  The  very  fact  that  Ada's  block  structure  imphes  that  more  than 
one  task  can  access  the  same  variable  implies  that  a  shared  memory  architecture 
is  required. 

Several  implementations  of  Ada  exist  on  shared  memory  architectures  [Ric89] 
[DFSS89,AL89,CCB89]  [Alhant, Sequent].  In  addition,  it  has  also  been  shown 
that  parallel  numerical  algorithms  normally  written  in  FORTRAN  for  these  ma- 
chines can  also  be  written  in  Ada  [Blu81,SS85].  The  intended  goal  of  our  work 
is  to  add  to  this  effort  by  illustrating  how  loop-level  parallelism  can  be  both 
expressed,  and  efficiently  implemented,  in  Ada. 

2.2.2  Deficiencies  of  Ada  Tasking  Model 

Three  significant  deficiencies  in  the  Ada  tcisking  model  are  majiifested  when  one 
attempts  to  express  loop-level  parallelism: 

1.  Although  no  parallel  loop  construct  exists  in  Ada,  one  can  use  tzisks  to 
specify  a  semantically  equivalent  entity  as  shown  by  Figure  2.4.  Con- 
structing a  parallel  loop,  like  many  other  problems  in  parallel  processing, 
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declare 

task  type  iterate.task  is 

entry  get_id(id  :  in  integer);  —  iterate.task  contains  one  entry, 
end  iterate.task; 

parJoop  ;  array  (IN)  of  iterate.task;  —  Create  N  iterate.tasks. 

task  body  iterate.task  is 

myJd  :  integer; 
begin 

accept  get.id(id  :  in  integer)  do 

myJd  :=  id;  —  Acquire  an  id  from  parent. 

end  get.id; 

5i 

5„ 
end  iterate.task; 
begin  —  All  iterate.tasks  begin  to  execute. 

for  i  in  1..N  loop  —  Identities  are  distributed  sequentijdly. 

parJoop(i).getjd(i); 
end  loop; 
end;  —  Wait  for  all  iterate.tasks  to  terminate. 

Figure  2.4:  An  Ada  Paradlel  Loop  Equivalent 
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requires  that  each  task  possess  a  unique  identity.  In  the  case  of  a  parallel 
loop,  this  identity  is  the  iteration  value  for  the  corresponding  iterate.  The 
only  mechanism  by  which  each  task  can  initialize  its  state  uniquely  (with 
its  iterate)  is  to  provide  each  with  an  appropriate  entry,  and  to  have  an- 
other task  perform  a  rendezvous  with  each  iterate  task.^  This  solution  is 
undesirable  not  only  because  of  the  rendezvous  that  must  be  performed, 
but  m.ore  importantly  because  of  the  bottleneck  that  results.  Additioned 
descriptions  of  this  shortcoming  can  be  found  in  [Yem82,Cla87,Blu8l]. 

2.  Task  initiation  is  a  two  stage  process  [Ada83].  Task  objects  are  first  cre- 
ated, and  then,  provided  no  exceptions  are  raised,  they  are  activated. 
While  this  two  stage  process  is  beneficial  for  handling  exceptions,  the  syn- 
chronization point  it  mandates  increases  overhead,  and  thereby  limits  the 
usefulness  of  Ada  in  specifying  loop-level  peirallelism. 

3.  In  order  to  provide  a  sophisticated  method  of  task  communication  and 
synchronization,  an  Ada  implementation  must  maintain  additional  con- 
text (information  associated  with  rendezvous,  entry  queues,  ta^k  priori- 
ties, etc.).  As  the  full  generahty  of  the  Ada  task  is  not  needed  to  specify 
loop-level  parallelism,  unnecessary  context  is  present.  However,  if  a  com- 
piler can  ascertziin  that  a  task  will  execute  in  a  Ught-weight  fashion  (i.e. 
with  no  synchronization  with  other  tasks),  then  it  can  implement  this  task 
in  a  more  efficient  manner  [Fra87,Hil82b,Blu81]. 


'Although  this  initialization  is  performed  seriaJly  in  Figure  2.4,  a  more  efficient  tree-based 
solution  is  possible.  However,  as  we  shall  see  in  Chapter  6  a  bottleneck  is  aJso  present  in  this 
solution. 
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These  deficiencies  and  their  solutions  are  further  described  in  Chapters  3  and 
4. 

2.2.3      Shared  Variables  in  Ada 

In  Ada  a  shared  variable  corresponds  to  a  variable  that  is  referenced  by  more 
than  one  task.  There  are  two  types  of  shared  variables: 

Synchronous:  A  synchronous  shared  variable  is  a  variable  referenced  by  more 
than  one  task,  but  for  vphich  local  copies  can  be  maintained  between 
language- defined  synchronization  points;  if  one  task  writes  to  this  vari- 
able, no  other  task  can  read  or  write  to  it  until  a  synchronization  point  is 
reached. 

Asynchronous:  An  asynchronous  shared  variable  is  a  variable  referenced  by 
more  than  one  task  which  does  not  satisfy  the  exclusive  access  constraint  of 
synchronous  shared  variables  between  synchronization  points.  In  addition, 
asynchronous  shared  variables  are  restricted  to  scalar  and  access  types. 

The  motivation  behind  these  definitions  is  that  a  synchronous  shared  variable 
may  be  cached  locally  to  a  processor,  and  flushed  to  shared  memory  at  syn- 
chronization points  of  the  taisk  executing  on  the  processor.  On  the  other  hand, 
asynchronous  shared  variables  Ccinnot  be  cached  locally;  they  must  reside  in 
shared  memory. 

A  major  distinction  between  these  two  types  of  shared  variables  is  that  the 
asynchronous  type  can  be  tagged  as  such  by  the  programmer  using  the  pragma 
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SHARED,  while  the  synchronous  variety  c2Lnnot.''  Thus,  all  variables  not  marked 
as  pragma  SHARED,  but  nevertheless  used  by  more  than  one  task,  are  syn- 
chronous shared  variables. 
Dewar  [Dew^90]  notes: 

It  is  a  source  of  continuing  confusion  that  pragma  SHARED  does  not 
correspond  to  the  concept  of  shared  variable,  as  defined  in  Steelman^ 
or  the  RM,  but  rather  to  a  particular  subset  of  shared  variables, 
namely  the  asynchronous  ones.  This  means  for  example  that  the 
Steelman  requirement  that  shared  variables  be  marked  is  not  satis- 
fied by  the  provision  of  pragma  SHARED,  since  synchronous  shared 
variables  do  not  have  to  be  marked  in  Ada. 

In  addition  to  this  deficiency,  not  allowing  the  programmer  to  designate 
composite  data  structures  as  pragma  SHARED  variables  presents  a  problem.  Most 
applications  that  employ  loop-level  parallelism  typically  have  several  threads 
working  independently  on  parts  of  a  shared  data  structure.  Since  different  tasks 
access  different  parts  of  this  composite  object  simultaneously,  it  is  used  in  an 
asynchronous  manner.  However,  Ada  does  not  permit  composite  objects  to  be 
declared  as  pragma  SHARED  variables. 

As  suggested  by  Hummel  [Hum88],  an  additional  pragma  can  be  used  to 
overcome  this  deficiency.  For  example  the  DEC  Ada  compiler  supports  a  pragma 
VOLATILE  to  designate  composite  objects  that  are  used  in  a  shared  manner.  This 
problem  and  many  others  are  being  addressed  in  the  revision  of  the  language, 


^Asynchronous  shared  v&riablcs  are  sometimes  referred  to  as  pragma  SHARED  variables,  while 
the  synchronous  variety  are  called  Ada  shared  variables. 
'Steelman  is  the  requirements  document  for  Ada-83. 
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called  Ada  9X.  Dewar  [Dew90]  discusses  some  of  the  issues  involved  with  shared 
variables  in  relation  to  possible  Ada  9X  requirements.  In  Chapter  7  we  discuss 
the  current  state  of  the  revision  process,  focusing  on  the  proposed  revisions  that 
directly  relate  to  our  work. 

In  addition  to  problems  discussed  by  Dewar,  Shulman  [Shu87]  and  Hum- 
mel [Hum88]  discuss  other  issues  pertaining  to  shared  variables.  In  particular, 
they  specify  an  algorithm  to  detect  unmarked  synchronous  shared  variables. 
Although  in  most  cases  these  variables  can  be  found  at  compilation  time,  the 
use  of  Ada's  separate  compilation  facihty  can  hamper  this  detection,  forcing  its 
postponement  until  bind  time. 

Synchronous  shared  variables  are  relevant  to  our  work  because  the  cache 
semantics  they  imply  affect  the  optimizations  that  we  can  perform.  Chapters  3 
and  5  address  this  issue  further. 

2.3      Related  Work  ' 

This  section  describes  related  work  to  this  thesis  and  is  broken  into  two  sections. 
The  matericil  presented  in  the  first  section  forms  the  basis  for  our  work.  The 
last  section  describes  a  proposed  solution  to  the  problem  of  obtaining  loop-level 
parcdlehsm  developed  independently  by  Dritz  [DriQOb]. 

2.3.1      The  NYU  Ada/Ed  System 

The  basis  for  our  work  is  the  NYU  Ada/Ed  system,  a  translator/interpreter 
of  the  full  Ada  language.  The  NYUAda  project  began  as  an  experiment  in 
large-scale  prototyping  using  the  SETL  language  [SDDS86].    The  goal  of  the 
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experiment  was  to  construct  an  Ada  compiler.  Due  to  the  flexibility  that  high- 
level  programming  in  SETL  provides,  the  compiler  was  rapidly  produced.  In 
1983,  the  Ada/Ed  system  became  the  first  validated  Ada  compiler,  in  addition 
to  becoming  the  de  facto  standard  operationeJ  definition  of  Ada  [Hum88]. 

The  next  goal  of  the  experiment  was  to  increase  system  performance  by 
translating  it  to  low-level  SETL  and  eventually  C  [KS84].  The  C  version  was 
developed  on  a  Vax  and  has  been  ported  to  a  variety  of  machines  including  Sun, 
AUiant,  and  the  PC. 

In  her  PhD  thesis,  Hummel  [Hum88]  described  a  highly  parallel  version 
of  the  Ada/Ed  run  time  system  free  of  serialization  points.  It  relies  on  the 
fetch_and_add  primitive  and  supports  Ada's  tasking  features  in  a  highly  paral- 
lel manner.  Her  work  has  been  incorporated  into  the  Ada/Ed-C  system  and 
has  been  implemented  on  the  NYU  Ultracomputer  [Beh90].  Although  Hum- 
mel's  work  avoids  implementing  Ada  tasks  as  heavy  weight  operating  system 
processes,  these  tasks  are  still  not  suitable  to  specify  loop-level  parallelism.  Our 
work  extends  the  work  of  Hummel  by  specifying  two  idioms  and  their  imple- 
mentations that  allow  efficient  loop-level  parallelism  to  be  realized. 

2.3.2      Dritz's  Work 

We  conclude  this  chapter  by  describing  a  suggested  solution  developed  indepen- 
dently by  Dritz  [DriQOb]  to  obtain  loop-level  parallelism  in  Ada.  In  this  work, 
Dritz  shows  how  several  important  tasking  paradigms  can  be  realized  in  Ada. 
He  describes  monitors,  barriers,  gates,  and  a  self-scheduling  for-loop.  The  latter 
paradigm  is  similar  in  spirit  to  the  gen.minitask  generic  discussed  in  Chapter  4 
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and  is  shown  in  Figure  2.5. 

A  typical  use  of  this  generic  would  require  an  instantiation  as  follows: 

procedure  self_schedullng_loop  is  new  self-scheduling. loops  (st,  loopJ5ody); 

where  st  corresponds  to  the  "common  baise  type  of  the  expressions  for  the  bounds 
[of  the  loop],  or  any  subtype  thereof  containing  both  of  the  bounds".  Declaring  st 
as  integer  will  certainly  suffice.  The  second  parameter,  loop_body,  corresponds 
to  a  procedure  that  contains  the  statements  in  the  body  of  the  loop.  After 
this  generic  has  been  instantiated,  a  parallel  loop  is  obtained  by  calling  the 
instantiated  procedure  in  the  following  manner: 

self_scheduling_loop(10,  1,  1000); 

Although  this  solution  is  quite  elegant,  it  differs  from  our  work  in  the  fol- 
lowing ways: 

1.  No  optimized  implementation  algorithm  has  been  devised  [DriQOa].  There- 
fore, worker  tasks  are  implemented  as  a  regular  Ada  task. 

2.  The  task  initialization  problem  has  not  been  overcome;  iterates  are  as- 
signed sequentially.  This  implies  that  overhead  associated  with  this  pzira- 
digm  is  similar  to  that  of  the  Ada  parallel  loop  equivalent  shown  in  Fig- 
ure 2.4. 

3.  There  is  no  mechanism  for  the  parent  thread  to  inspect  the  status  of  amy 
of  the  child  threads  that  execute  the  iterates. 

In  [DriQOb],  Dritz  specifies  an  enhanced  version  of  this  generic  that  reuses 
the  iteration  tasks  in  subsequent  instantiations.  This  facihty  is  desirable  if  the 
overhead  associated  with  the  initiation  of  worker  tasks  proves  to  be  a  bottleneck. 
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generic 

type  index.type  is  (<>);  --  any  discrete  subtype 
with  procedure  body.proc  (i  :  in  index.type); 

procedure  self_scheduling_loops  (num. tasks  :  in  positive; 
lb,  ub  :  index.type); 

procedure  self^cheduling. loops  (num.tasks  :  in  positive; 
lb,  ub  :  index.type)  is 
task  type  worker; 

workers  :  array  (1  ..  num.tasks)  of  worker; 
task  loop.control  is 

entry  get_loop.status  (done  ;  out  boolean); 
entry  get.index.index.vaiue  (i  :  out  index.type); 
end; 

task  body  worker  is 
i  :   index.type; 
done  :  boolean; 
begin 
loop 

loop.control. get Joop^tat us  (done); 
exit  when  done; 
loop  control. getJndex. value  (i); 
body.proc  (i); 
end  loop; 
end; 

task  body  loop.control  is 
begin 

for  j  in  lb  .    ub  loop 

accept  getJoop-Status  (done  :  out  boolean)  do 

done  ;=  false; 
end; 
accept  getJndex.value  (i  :  out  index.type)  do 

i  :- j; 
end; 
for  k  in  1. num.tasks  loop 

accept  getJoop-status  (done  :  out  boolean)  do 

done  :=  true; 
end; 
end  loop; 
end; 
begin 

null, 
end; 

Figure  2.5:  Dritz's  Self.Scheduling-Loops 


Chapter  3 


Beacon  Tasks  and 
Fetch_And_Add 


This  chapter  presents  a  solution  to  the  task  initialization  problem  as  discussed 
in  Chapter  2.  This  solution  uses  a  generic  package,  gen.beacon,  to  make  the 
fetch_and-add  primitive  available  at  the  Ada  level.  By  making  this  primitive 
available  to  the  Ada  programmer,  we  not  only  solve  the  task  initialization  prob- 
lem, but  aJso  extend  the  possibilities  of  process  synchronization  in  Ada. 

Schonberg  and  Schonberg  [SS85]  first  introduced  this  idea  through  the  use 
of  the  beacon  task  type.  In  this  work,  we  specify  an  enhanced  version  of  the 
beacon  task.  Since  the  work  of  Schonberg  and  Schonberg  forms  the  foundation 
of  this  work,  we  first  describe  the  beacon  task  as  given  in  [SS85].  We  then 
show  how  encapsulating  this  ta^k  into  a  generic  package  helps  to  reduce  run- 
time overhead  and  storage.  This  generic  package  is  used  in  Chapter  4  where 
the  problem  of  tasking  overhead  is  addressed.  We  also  present  a  parcdlel  queue 
algorithm  that  utilizes  the  fetch_andjdd  primitive  [GGK'''83].    This  aJgorithm 
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has  been  rewritten  in  Ada  to  utilize  the  fetch_and-add  construct  we  suggest. 

In  specifying  a  fetch _and .add  construct  at  the  Ada  level,  we  impose  the  fol- 
lowing three  constraints: 

1.  The  fetch.and.add  construct  must  be  easy  to  use  syntactically  and  shoiild 
not  add  unnecessary  complexity  to  any  program  that  utilizes  it. 

2.  The  semantics  of  the  fetch.and.add  construct  must  be  consistent  with  those 
of  the  fetch.and.add  primitive.  This  enables  architectures  that  do  not 
support  the  fetch.and.add  primitive  to  execute  this  construct  with  the 
same  results,^  albeit  not  as  efficiently. 

3.  To  make  best  use  of  those  architectures  that  do  support  the  fetch.and.add 
primitive,  a  translation  of  the  fetch_and_add  construct  to  the  hardware 
primitive  must  be  available.  This  requires  that  a  compiler  targeted  for 
a  machine  that  supports  the  fetch.and.add  primitive  be  able  to  recognize 
and  exploit  uses  of  the  fetch.and.add  construct.  A  direct  translation  from 
the  fetch_and.add  construct  to  the  fetch.and.add  primitive  is  desirable. 

The  originaJ  beacon  task  satisfies  all  three  of  these  constrziints.    However, 
by  allowing  for  a  more  direct  translation  from  the  fetch.and.add  construct  to 
the  fetch  and  add  primitive,  the  new  gen  Jieacon  generic  package  satisfies  the  3' 
constraint  in  a  more  robust  manner.  In  order  to  understand  why  this  is  the  case, 
we  turn  our  attention  to  the  beacon  task. 


'Since  the  fetch.«nd.«dd  primitive  incorporates  non-determinism  into  its  semantics,  same 
rtsults  m»y  not  be  possible  even  on  the  same  architecture.  However,  we  expect  the  seman- 
tics of  fetch.and.add  construct  to  be  consistent  across  architectures,  regardless  of  whether  the 
fetch und.«dd  primitive  is  available. 
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3.1      The  Beacon  Task  Type 

In  this  section  we  present  the  beacon  task  type  as  originally  formulated  [SS85]. 
An  instance  of  a  beacon  task  provides  the  user  with  an  integer  variable  which 
may  be  used  in  a  shared  fashion.  In  addition  to  reading  and  writing  this  variable, 
an  operation  that  satisfies  the  fetch_and_add  semantics  is  permitted 

As  shown  in  Figure  3.1,  the  beacon  task  contains  an  integer  variable,  v. 
This  variable  corresponds  to  the  shared  variable  to  be  used.  Access  to  it  is 
provided  through  four  entry  calls.  In  addition  to  the  three  operations  mentioned, 
an  initialization  entry  call,  init,  is  also  provided.  In  order  to  ensure  that  v  is 
properly  initialized,  another  task  must  call  the  init  entry,  supplying  an  initial 
value.  Once  this  rendezvous  has  occurred,  the  beacon  task  accepts  any  of  the 
other  three  entry  calls:  read,  write,  and  F_andJ\. 

These  entry  calls  are  enclosed  in  a  select  statement.  The  semantics  of  this 
statement  specify  that  if  more  than  one  of  the  accept  statements  are  open, 
i.e.  a  task  has  called  their  corresponding  entry,  then  "one  of  them  is  selected 
arbitrarily"  (ARM  9.7.1(6)).  Since  the  select  statement  is  enclosed  by  a  loop, 
this  process  continues  until  no  more  active  callers  exist.  These  semantics  are 
precisely  the  way  our  target  machine  model  treats  simultaneous  reads,  writes, 
and  fetch.and.adds  -  as  if  they  occurred  in  some  unspecified  serial  order.  Thus, 
the  semantics  of  the  beacon  task  match  nicely  with  its  desirable  implementation. 

By  declaring  v  local  to  the  beacon  task,  access  to  it  is  restricted;  it  is  ordy 
possible  through  one  of  the  four  entry  Ccdls.  This  characteristic,  when  combined 
with  the  fact  that  the  beacon  task  can  oidy  be  executing  one  of  these  entries  at 
a  time,  ensures  that  atomic  access  to  this  variable  is  maintained. 
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task  type  beacon  is 

entry  init(e:  in  integer); 

entry  read(l;  out  integer); 

entry  write(e    in  integer); 

entry  F^nd_A(l:  out  integer;  e:  in  integer); 
end  beacon; 

task  body  beacon  is 

v;  integer; 
begin 

accept  init(e:  in  integer)  do 

V  :=  e; 
end  init; 
loop 
select 

accept  read(l    out  integer)  do 

I  :=  v; 
end  read; 
or 

accept  write(e;  in  integer)  do 

V  :=  e; 
end  write; 

or 

accept   F  jnd_A(l:  out  integer;  e:  in  integer)  do 
1  :=  v; 

V  :=  V  +  e; 
end  Fjnd_A; 

or 

terminate, 
end  select; 
end  loop; 
end  beacon; 

Figure  3.1:  Definition  of  the  beacon  Task  Type  as  in  [SSS5] 
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Of  particular  importance  is  the  semantics  of  the  F.and_A  entry.  As  mentioned 
in  Chapter  2.1,  the  semantics  of  the  fetch-and_add  primitive  specify  that  the 
variable  is  incremented  by  the  value  supplied  and  the  original  VcJue  returned. 
These  two  operations  must  appear  to  occur  atomically.  Since  access  to  the  local 
variable  is  monitored  by  the  beacon  task,  the  F_and_A  entry  ensures  that  these 
semantics  are  upheld. 

3.1.1      Transforming  the  Beacon  Task 

Due  to  the  manner  in  which  the  beacon  task  is  constructed,  it  can  be  transformed 
into  a  passive  data  structure.  On  a  machine  that  supports  the  fetch.and.add 
primitive,  reads,  writes,  and  F_and_A's  are  performed  atomically.  Hence,  we 
would  like  to  replace  a  call  to  any  of  these  entries  with  the  actual  read,  write  or 
fetch_and_add  operation.^  As  described  in  [SS85],  the  "beacon  thus  disappears 
as  a  separate  task  entity,  and  leaves  behind  a  simple  data  structure  and  code 
fragments  to  access  it."  This  type  of  transformation  is  an  example  of  the  kind 
proposed  by  Hilfinger  [Hil82a,Hil82b,Hil90]. 

Note  that  even  on  a  machine  that  does  not  support  the  fetch_and.add  prim- 
itive, a  transformation  of  this  type  Ccin  still  be  performed.  However,  synchro- 
nization must  be  provided  to  ensure  that  the  two  instructions  that  comprise  the 
F.and_A  rendezvous  are  both  executed  before  any  other  task  is  permitted  access 
to  V.  A  simple  semaphore  is  sufficient  to  provide  this  synchronization. 

Although  it  would  appear  that  each  CcJl  to  the  beacon  task  can  be  directly 
translated  into  the  corresponding  hardware  operation,  this  is  not  the  case.  Due 


^Initialization  is  viewed  as  a  form  of  a  write  operation. 
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to  Ada  tasking  semantics,  each  of  these  operations  must  be  preceded  by  a  check 
of  some  status  variables  associated  with  the  transformed  beacon  task.  These 
variables  are  used  to  coordinate  beacon  initialization,  as  well  as  ensuring  that 
an  abnormal  beacon  is  not  accessed.  They  are  contained  in  a  data  structure  aJong 
with  the  integer  value  associated  with  the  beacon  task.  The  data  structure  that 
represents  the  transformed  beacon  task  is: 

type  beacon_struct  is  record 

value  :   integer; 

initialized  :  boolean  :=  false; 

completed  :  boolean  :=  false; 

init-sync  :  integer  :=  0; 
end  record; 

The  value  field  corresponds  to  the  loczJ  variable,  v,  in  the  original  beacon  task. 
Two  boolean  fields,  initialized  and  completed  hold  information  on  the  callabihty 
of  the  beacon  task.  A  beacon  task  is  initialized  once  some  other  task  provides 
an  initial  value  via  the  init  entry  call.  A  beacon  task  becomes  completed  when 
it  is  aborted  or  when  the  task  that  declares  it,  its  master  task,  cind  all  of  its 
descendents  terminate.^.  If  any  of  these  tasks  try  to  access  the  beacon  task 
when  it  is  completed,  TASKING_ERROR  is  raised  in  the  calling  task.  The  integer, 
init  sync,  is  used  to  coordinate  multiple  initicdization  attempts. 

We  now  discuss  the  code  fragments  that  a  calhng  task  executes  in  place  of 
the  entry  cails  to  a  beacon  task  T. 


'We  discuss  task  dependence  in  more  detail  in  Chapter  4 
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if  T. completed  then 

raise  TASKING_ERROR; 
end  if; 

num  :=  /etc/i_and_add(T.init_sync,  1); 
if  num  =  0  then 
T.value  :—  e; 
T. initialized  :=  true; 
else  —  task  T  has  already  been  initialized. 

case  type  of  entry  call  is 

when  timed  =>  wait  for  the  stated  time 

when  unconditional  =>  block  until  calling  task  is  aborted 

when  conditional  => 

execute  statements  associated  with  else  part 
end  case; 
end  if; 

Figure  3.2:  Code  Fragment  for  T.init(e) 

The  Init  Entry  Call 

Consider  the  code  fragment  executed  when  the  init  entry  is  called  (Figure  3.2). 
First,  a  check  is  performed  to  ensure  that  T  is  not  marked  completed;  it  has 
not  completed  its  execution  or  been  aborted.  Ada  Semantics  specify  that  both 
"the  call  of  an  entry  of  an  abnormal  task"  (ARM  9.10(7))  and  "an  attempt  to 
call  an  entry  of  a  task  that  has  completed  its  execution"  (ARM  9.5(16))  result 
in  TASKING_ERROR  being  raised  at  the  place  of  the  call.  Thus  if  T. completed  is 
true,  TASKING_ERROR  is  raised  in  the  Ccdling  ta^k.  This  precludes  the  execution 
of  the  rest  of  the  code  fragment. 

If  T. completed  is  still  fcdse,  an  attempt  is  made  to  mimic  the  init  rendezvous. 
To  do  tbis  we  must  ensure  that  only  one  of  several  potential  callers  is  allowed 
to  initialize  v.  This  is  done  by  having  each  of  the  potential  initializers  execute 
a  fetch.and.add  operation  with  increment  of  one  on  the  init_sync  field.  Only  one 
of  these  tasks  will  have  the  original  value  of  zero  returned.  The  calUng  task  that 
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receives  this  vaJue  is  designated  as  the  "winner"  and  is  allowed  to  set  the  value 
field  to  the  value  passed.  The  initialized  flag  is  then  set  to  true.  By  executing 
a  fetch_and_add  on  the  init_sync  field,  we  ensure  that  only  one  task  is  allowed  to 
initicdize  the  value  field,  avoiding  a  race  condition. 

If  a  non-zero  value  is  returned  from  the  fetch.and^dd  operation,  then  another 
task  has,  or  is  about  to,  initialize  the  value  field.  When  this  occurs  the  action 
performed  is  dependent  on  the  type  of  init  entry  call  that  was  executed.  If  the 
c2lI1  was  a  timed  entry  call,  the  calling  task  would  normally  wait  the  specified 
amount  of  time  before  proceeding  with  the  statement  immediately  following  the 
call.  To  uphold  these  semantics,  the  calling  task  delays  the  appropriate  amount 
of  time  and  then  exits  the  code  fragment  for  init. 

When  a  conditional  entry  call  to  init  is  unsuccessful  -  another  task  has  al- 
ready been  selected  as  the  initializer  -  the  calling  task  executes  the  else  part 
associated  with  the  entry  call.  Thus,  when  a  non-zero  value  is  returned  from 
the  fetch. and.add  of  init_sync,  the  else  part  of  the  entry  call  is  executed. 

If  the  entry  call  is  unconditional,  then  Ada  semantics  specify  that  the  calling 
task  should  wait  until  this  Ccill  can  be  satisfied.  However,  the  beacon  task  is 
constructed  so  that  it  only  accepts  one  init  entry  call;  a  subsequent  rendezvous 
can  never  occur.  Thus,  if  a  t«isk  that  executes  an  unconditioned  entry  call  receives 
a  non-zero  value,  it  is  blocked  until  it  is  aborted.  This  program  most  likely 
contains  a  logic  error. 

The  Read,  Write,  and  F_and^  Entry  Calls 

The  code  fragments  for  the  read,  write,  and  F_and_A  entry  calls  arc  similar.  We 
present  each  fragment  according  to  the  type  of  entry  call.    Before  idlowing  a 
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while  not  T. initialized  or  T. completed  loop 

if  T. completed  then 

raise  TASKING_ERROR; 

end  if; 
end  loop; 
case  type  of  operation  is 

when  read  =>  var  :=  T. value; 

when  write  =>  T. value  :=  val; 

when  F-and_A  =>  val  :=  f  etch -and -add  [J. \/a\uc,  inc); 
end  case; 

Figure  3.3:  Code  Fragment  for  an  Unconditional  Entry  CaJl 

calling  task  access  to  the  beacon  variable,  check  must  be  performed  to  ensure 
that  the  beacon  task  has  been  initialized  and  has  not  been  aborted.  We  describe 
below  the  manner  in  which  this  is  done. 

The  code  fragment  for  an  unconditional  entry  call  is  given  in  Figure  3.3.  If  T 
is  yet  to  be  initicJized  and  is  not  marked  completed,  the  calling  task  loops  until 
one  of  these  two  conditions  change.  If  T  becomes  completed,  i.e.  another  task 
aborts  it,  TASKING_ERROR  is  raised  in  the  calling  task.  This  corresponds  to  the 
beacon  task  being  aborted  before  it  reaches  one  of  the  accept  statements  con- 
tained in  the  select  statement.  As  is  the  case  with  the  init  code  fragment,  raising 
TASKING-ERROR  precludes  execution  of  the  rest  of  the  code  fragment.  However, 
if  T  becomes  initialized,  before  it  is  completed,  the  value  field  is  accessed  in  the 
appropriate  fashion  (read,  write,  or  fetch_and.add). 

If  the  entry  call  is  of  the  conditional  variety,  then  the  code  fragment  described 
in  Figure  3.4  is  executed.  Since  a  conditioned  entry  call  attempts  to  perform  a 
rendezvous  immediately,  a  loop  is  no  longer  needed;  we  first  check  if  the  task  is 
completed,  raising  TASKING  JERROR  if  it  has.  If  T  is  not  completed,  a  check  is 
made  to  see  if  the  variable  has  been  initialized.  If  T  has  been  properly  initialized, 
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if  T. completed  then 

raise  TASKlNG_ERROR; 
end  if 

if  T. initialized  then 
case  type  of  operation  is 

when  read  =>  var  :=  T. value; 

when  write  =>  T. value  :=  val; 

when  F-andJ\  =>  val  :=  /e<c/i_anrf_add(T. value,  inc); 
end  case; 

Execute  any  other  statements  associated 

—  with  the  conditional  part  of  the  select  statement, 
else 

—  Execute  the  statements  associated  with 

—  the  else  part  of  the  select  statement. 
end  if; 

Figure  3.4:  Code  Fragment  for  a  Conditional  Entry  Call 

the  appropriate  operation  is  performed  followed  by  any  other  statements  present 
in  the  select  option  of  this  entry  caJl.  If  T  is  not  initialized,  then  the  rendezvous 
does  not  occur;  the  statement(s)  conteiined  in  the  else  part  of  the  select  statement 
are  executed. 

The  code  fragment  for  Umed  entry  calls  utilizes  the  methods  used  in  both 
the  simple  entry  call  and  the  conditional  entry  call.  It  is  given  in  Figure  3.5.  A 
timed  entry  call  specifies  a  delay  D  which  is  the  maximum  amount  of  time  the 
calling  task  is  willing  to  wait  for  the  receiving  task  to  reach  its  accept  statment 
(ARM  9.7.3(1)).  Thus,  a  timed  entr>'  call  does  not  necessarily  require  that  the 
rendezvous  occurs  immediately.  Once  agziin.  a  loop  is  employed  to  check  the 
initialized  and  completed  status  variables. 

Once  initialization  is  completed,  a  check  must  be  performed  to  ensure  that  D 
time  has  not  passed.  If  D  time  has  passed  since  execution  of  the  code  fragment 
begun,  the  rendezvous  does  not  occur;  the  statements  associated  with  the  delay 
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DT  ■-  CLOCK  +  D: 

while  not  T. initialized  or  T. completed  loop 
if  T  completed  then 

raise  TASKlNG_ERROR; 
end  if; 

exit  when  CLOCK  >  DT; 
end  loop; 

if  CLOCK  <  DT  then 
case  type  of  operation  is 

when  read  =>  var  :=  T. value; 
when  write  =>  T.value  :=  val; 

when  Fjnd_A  =>  val  :=  /etc/i_an</_ad<f  (T.value,  inc); 
end  case; 
Execute  any  other  statements  associated 

—  with  the  entry  call  part  of  the  select  statement. 
else 

—  Execute  the  statements  associated  with 
the  delay  part  of  the  select  statement. 

end  if; 

Figure  3.5:  Code  Fragment  for  a  Timed  Entry  Call  with  delay  D 
part  of  the  select  statement  axe  executed. 


3.1.2     Handling  Aborts  : 

Since  the  beacon  task  is  originally  an  Ada  task,  it  may  be  aborted  by  ajiy 
task  that  can  name  it.  Although  the  abort  statement  "shoiild  be  used  only 
in  extremely  severe  situations"  (ARM  9.10(10)),  care  must  be  taken  to  ensure 
that  proper  Ada  semantics  are  upheld  when  beacon  tasks  are  abstracted  away. 
Therefore,  we  must  ensure  that  zmy  transformation  of  the  beacon  task  properly 
handles  any  attempts  to  abort  it.  Additionally,  we  must  show  that  if  any  task 
that  can  call  one  of  the  beacon  task's  entries  is  aborted,  the  integrity  of  the 
trcinsformed  beacon  task  is  maintained. 
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When  a  beacon  tcisk  is  aborted,  the  completed  field  is  set  to  true.  As  we 
have  seen  by  the  code  fragments,  any  task  that  calls  an  entry  of  this  aborted 
task,  will  find  the  completed  field  set  to  true.  The  caUing  task  will  then  take  the 
appropriate  action  depending  on  the  type  of  entry  call. 

When  a  calUng  task  is  aborted,  the  transformed  beacon  task  is  not  affected: 
the  beacon  data  structure  still  exists  for  other  ta^ks  to  access.  These  semantics 
are  consistent  with  those  of  the  untransformed  beacon  task. 

Since  a  task  that  is  aborted  need  not  terminate  until  it  reaches  a  synchro- 
nization point  (ARM  9.10(6)),  the  body  of  the  code  fragments  may  be  executed 
in  its  entirety.  However,  since  completing  the  code  fragment  represents  the  end 
of  the  entry  call,  and  hence  a  synchronization  point,  an  aborted  task  must  cease 
execution  at  this  time. 

3.2      Packaging  a  Beacon  Task 

Recall  that  one  moti\-ation  behind  creating  the  beacon  task  is  to  make  the 
fetch. and_add  primitive  a\'ailable  at  the  Ada  level.  To  this  extent,  the  beacon 
task  achieves  this  goal.  However,  by  introducing  am  active  object  (a  task)  to 
represent  a  passive  one  (a  shared  \-ariable).  the  semantics  of  Ada  have  neces- 
sitated the  inclusion  of  additional  status  x'ariables.  These  status  \-ariables  are 
undesirable:  we  desire  a  direct  translation  of  uses  of  the  fetch .and^dd  idiom  to 
actual  fetch.and^dd  operations. 

Since  some  of  these  status  checks  are  required  by  the  semantics  of  task  com- 
pletion and  abort  instructions,  one  may  try  to  specify  a  beacon  task  in  a  rasinner 
such  that  these  issues  are  not  of  concern;  we  want  a  beacon  task  that  is  not  part 
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package  beacon.pack  is 

type  sharedJnteger  is  access  integer; 

function  read(var.  in  sharedJnteger)  return  integer); 

procedure  write(var:  in  out  sharedJnteger;  val:  in  integer; 

function  faa(var;  in  out  sharedJnteger;  inc:  in  integer)  return  integer; 
end  beacon.pack; 

package  body  beacon.pack  is 

type  sharedJnteger  is  access  integer; 
task  type  beacon  is 

entry  read-var(var:  in  sharedJnteger;  val  out  integer); 
entry  write.var(var;  out  sharedJnteger;  val:  in  integer); 
entry  F.and_A(var:  in  out  shared.integer; 
inc:  in  integer;  val:  out  integer); 
end  beacon; 

beacon.task  :  beacon; 

function  read(var:  in  shared.integer)  return  integer  is 

tmp  :  integer; 
begin 

beacon.task  read.var(var,  tmp); 

return  tmp; 
end  read; 

procedure  write(var;  in  out  sharedJnteger;  val:  in  integer)  is 
begin 

beacon.task. write.var(var,  val); 
end  write; 

function  faa(var:  in  out  sharedJnteger;  inc:  in  integer)  return  integer  is 

tmp  :  integer; 
begin 

beacon.task. F.and.A(var,  inc,  tmp); 

return  tmp; 
end  faa; 

end  beacon.pack; 

Figure  3.6:  Body  of  Beacon  Package  as  in  [HumSS] 
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Since  some  of  these  status  checks  are  required  by  the  semantics  of  task  com- 
pletion and  abort  instructions,  one  may  try  to  specify  a  beacon  task  in  a  manner 
such  that  these  issues  are  not  of  concern;  we  want  a  beacon  task  that  is  not  part 
of  a  task  dependence  relation  with  its  calhng  tasks,  and  also  can  not  be  aborted 
by  its  callers. 

In  an  approach  suggested  by  Hummel  [Hum88],  the  beacon  task  is  hidden 
inside  the  private  part  of  a  library  package  body  (see  Figure  3.6).  Access  to  its 
entries  is  only  possible  by  procedure  ccills  visible  to  the  subprogram  using  this 
package. 

This  approach  offers  the  following  advantages: 

•  Since  the  task  is  hidden  in  the  private  part  of  the  Hbrary  package,  no 
subprogram  that  uses  it  can  name  it,  and  therefore  abort  it  directly.'' 

•  Since  the  task  is  declared  in  a  library  package,  the  subprogram  that  uses 
this  package  does  not  need  to  wjiit  for  this  task  to  complete  before  it  can 
terminate  (ARM  9.4(13)). 

A  direct  result  from  these  two  observations  is  that  we  no  longer  need  to  store  in- 
formation about  the  beacon  task's  status  (completed  or  not);  it  will  not  complete 
until  after  edl  tasks  that  can  potentisJly  call  it  have  terminated.  Furthermore, 
since  it  is  contciined  in  a  hbrary  package,  tasks  that  can  call  it  need  not  wait  for 
its  completion  in  order  to  terminate  (ARM  9.4(13)). 

However,  the  proposed  solution,  (Figure  3.7)  differs  from  the  original  beacon 
task  (Figure  3.1)  in  a  few  ways: 


*The  task  can  still  he  aborted  if  its  master  task  is  aborted.   However,  in  this  case  all  potential 
callers  will  also  be  aborted. 
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task  type  beacon  is 

entry  read.var(var:  in  sharedJnteger;  val  out  integer); 
entry  write.var(var:  out  shared.'mteger;  val:  in  integer); 
entry  F.and_A(var:  in  out  shared.integer; 
inc:   in  integer;  val:  out  integer); 
end  beacon; 

task  body  beacon  is 

v:  integer; 
begin 
loop 
select 

accept  read_var(var:  in  shared.integer;  val  out  integer)  do 

val  :=  var.all; 
end  read.var; 
or 

accept  write-var(var:  out  shared.integer;  val:  in  integer)  do 

var.all  :=  val; 
end  write.var; 
or 

accept  F.and_A(var:  in  out  shared.integer; 
inc:  in  integer;  val:  out  integer)  do 
val  :=  var.all; 
var.all  :=  var.all  +  inc; 
end  F_and_A; 
or 

terminate; 
end  select; 
end  loop; 
end  beacon; 

Figure  3.7:  Beacon  Task  Type  as  in  [Hum88] 
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•  The  beacon  package  version  no  longers  ensures  that  the  local  variable  as- 
sociated  with  the  beacon  teisk  has  been  properly  initiahzed.  Thus,  caJls  to 
read  and  F.and_A  may  return  unknown  results. 

•  The  beacon  task  contained  in  the  beacon  package  no  longer  contaiins  a  local 
variable.  Thus,  all  entry  czJls  to  this  task  must  supply  an  access  variable 
that  is  to  be  modified  or  read.  This  gives  the  appearance  that  operations 
on  two  different  fetch ^and.add  variables  will  occur  in  a  sequential  manner. 
Although  this  is  true  of  beacon  package  at  the  Ada  level,  a  compiler  can 
removal  this  potential  bottleneck. 

Consider  two  tasks,  both  of  which  are  to  perform  a  fetch.and.add  opera- 
tion on  distinct  variables.  Since  both  tasks  subsequently  call  the  F_and_A 
rendezvous,  Ada  semantics  specify  that  these  rendezvous  must  occur  se- 
quentially. However,  since  the  two  tasks  operate  asynchronously,  the  order 
in  which  these  calls  are  made  is  not  known. 

Now  consider  when  these  entry  calls  are  translated  directly  to  the  fetch_- 
and.add  hardware  primitive.  Instead  of  engaging  in  a  rendezvous,  each 
task  now  performs  a  fetch.and_add  operation.  Since  the  programmer  is 
unable  to  detect  the  order  in  which  these  operations  occur,  they  can  be 
executed  concurrently. 

Thus,  although  an  undesirable  bottleneck  is  present  at  the  Ada  level,  it  is 
removed  when  these  entry  calls  are  transformed.  The  underlying  hardware 
decides  in  what  order  the  operations  occur. 

The  second  difference  is  transparent  to  the  programmer  and  therefore  not  an 
issue.  However,  tlir  first  difTcrcnce  is  one  the  programmer  must  be  aware  of.  A 
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solution  to  this  problem,  given  by  Mitsolides  [Mit88],  would  be  to  provide  the 
user  of  the  beacon  package  with  another  procedure  to  allow  initialization  (and 
creation)  of  new  fetch_and_add  variables.  Using  this  approach,  the  structure  that 
is  used  to  represent  a  transformed  beacon  task  is  reduced  to  the  following: 

type  beacon_struct  is  record 
value  :  integer; 
initialized  :  boolean  :=  false; 
init_sync  :  integer  :=  0; 
end  record; 
The  code  fragments  are  hkewise  reduced  by  ehminating  all  checks  of  the  elimi- 
nated completed  field. 

3.3      Our  Construct:  The  Generic  Beacon  Pack- 
age 

Although  we  have  reduced  the  run  time  overhead  to  calls  of  read,  write  and 
F_and_A  by  encorporating  the  beacon  task  in  a  hbrary  package,  Ada  semantics 
still  requires  us  to  inspect  the  initialized  field  before  access  to  the  beacon  variable 
can  occur.  Furthermore,  two  variables  are  stiU  required  for  the  management  of 
initialization. 

In  an  attempt  to  eliminate  these  variables  and  the  undesirable  overhead 
that  is  required  to  inspect  them,  while  still  providing  means  to  ensure  proper 
initialization,  we  introduce  the  generic  beacon  package,  gen_beacon  (see  Figures 
3.8  and  3.9).  This  generic  package  captures  aJl  of  the  advantages  of  the  beacon 
task  and  package,  while  also  solving  the  problem  of  beacon  initialization. 
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generic 

initial.value  :  integer; 

package  gen.beacon  is 

function  read  return  integer; 

procedure  write(val:  in  integer); 

function  faa(inc:  in  integer)  return  integer; 
end  gen.beacon; 

package  body  gen_beacon  is 
task  type  beacon  is 

entry  read.var(val  out  integer); 

entry  write-var{val:  in  integer); 

entry  F.and_A(val:  out  integer;  inc:  in  integer); 
end  beacon; 

beacon.task  :  beacon; 

function  read  return  integer  is 

tmp  :  integer; 
begin 

beacon. task.  read. var(tnnp); 

return  tmp; 
end  read; 

procedure  write(val:  in  integer)  is 
begin 

beacon.task.write.var(val); 
end  write; 

function  faa(inc:  in  integer)  retiu-n  integer  is 

tmp  :  integer; 
begin 

beacon. task. F.and.A{tmp,  inc); 

return  tmp; 
end  faa; 

task  body  beacon  is  . . . 

—  See  Figure  3.9 

end  beacon; 
end  gen.beacon; 

Figure  3.8:  The  Generic  Beacon  Package 
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As  seen  in  Figure  3.8,  the  generic  beacon  package  overcomes  the  problem  of 
initialization  by  having  an  initial  value  provided  as  the  generic  parameter.  Since 
this  initicJ  value  is  supplied  before  the  beacon  task  is  activated,  we  not  only  en- 
sure that  an  initial  value  is  provided,  but  also  ehminate  the  need  to  coordinate 
initialization  attempts  by  competing  tasks.  This  fact,  combined  vrith  the  "pack- 
aging" of  the  beacon  task,  reduces  the  representative  beacon  data  structure  to 
a  single  integer  variable:  an  integer  that  holds  the  value  of  the  fetch_and_add 
variable.  Thus,  the  data  structure  to  manage  a  beacon  task  becomes: 

type  gen.beacon_struct  is  record 

value  :  integer; 
end  record; 

An  instantiation  of  gen  Jjeacon  provides  three  visible  subprograms:  the  func- 
tions read  and  faa,  and  the  procedure  write.  Both  read  and  faa  return  integer 
values,  while  write  accepts  an  in  parameter  of  the  type  integer.  Each  of  the  visible 
subprograms  of  gen.beacon  calls  the  appropriate  entry  of  the  hidden  beacon_task 
(see  Figure  3.8). 

3.3.1      Implementation 

In  this  section  we  describe  the  Ada  semantics  of  gen_beacon  and  show  how  under 
these  semantics  an  efficient  implementation  can  be  realized.  Below  is  an  example 
of  instantiation  of  the  gen_beacon  package. 

my -beacon  is  new  gen_beacon(l); 

The  instantiation  of  a  generic  package  has  the  same  effect  as  specifying  the 
package  in  the  place  where  the  generic  is  instantiated,  substituting  all  references 
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task  body  beacon  is 

v:  integer  :=  initial.value; 
begin 
loop 

select 

accept  read.var(val  out  integer)  do 

val  :=  v; 
end  read-var; 
or 

accept  write.var(val:  in  integer)  do 

V  :=  val; 
end  write_var; 

or 

accept  F.and_A(val:  out  integer;  inc:  in  integer)  do 
val  :=  v; 

V  :=  V  +  inc; 
end  F.and^; 

or 

terminate; 
end  select; 
end  loop; 
end  beacon; 

Figure  3.9:  Beacon  Task  Body  for  the  Generic  Beacon  Package 
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to  generic  parameters  with  the  actuals  (ARM  12.3).  Once  the  parcimeters  of 
the  generic  have  been  evaluated,  the  elaboration  of  the  instantiated  package  can 
begin. 

We  now  consider  the  body  of  the  gen-beacon  package.  All  entities  declared 
here  are  not  visible  to  the  instantiator.  Elaborating  a  package  body  corresponds 
to  elaborating  its  declarative  part  (ARM  7.3(2)).  This  corresponds  to  elaborat- 
ing the  task  type  definition,  creating  the  task  object,  and  then  elaborating  the 
subprograms  and  beacon  task  body.  Since  the  task  type  and  bodies  of  the  all  of 
these  entities  are  known,  no  action  is  required  to  elaborate  them  (ARM  6.3(5)). 
Since  the  beacon  task  is  transformed  into  a  passive  object,  creating  it  corre- 
sponds to  allocating  space  for  the  beacon_struct  record. 

After  elaborating  the  declarative  part  of  the  package  body,  the  associated  se- 
quence of  statements  is  executed  (ARM  7.3(2)).  The  genJDeacon  package  does  not 
contain  a  sequence  of  statements.  Thus,  instantiating  gen.beacon  corresponds 
to  allocating  space  for  the  integer  variable  which  holds  the  beacon  variable. 

The  read,  write,  and  faa  Subprograms 

Since  the  beacon  task  is  no  longer  visible  to  its  user,  access  to  the  integer  variable 
it  monitors  must  be  provided  by  some  other  means.  This  is  the  role  of  the 
three  visible  subprograms:  read,  write,  and  faa.  The  semantics  of  these  routines 
are  straightforward;  each  routine  calls  the  appropriate  entry  call  of  the  hidden 
beacon  task. 

As  was  the  case  when  transforming  the  original  beacon  task's  entry  calls. 
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these  subprogram  ceJIs,  and  their  corresponding  entry  calls,  are  dso  trans- 
formed. For  example,  a  call  to  the  function  read  which  formerly  did  the  follow- 
ing: 

1.  call  the  read_var  entry, 

2.  access  the  monitor  variable, 

3.  return  its  value  to  the  read  function, 

4.  return  this  value  to  its  caller 

is  replaced  by  a  single  hardware  read  instruction.  The  write  procedure  is  trans- 
formed in  a  similar  fashion. 

Consider  the  faa  procedure.  As  is  the  case  with  the  read  and  write  subpro- 
grams, the  procedure  and  entry  calls  can  be  ehminated.  On  an  architecture  that 
supports  the  fetch_and-add  hardware  primitive,  the  faa  procedure  is  transformed 
exactly  as  is  the  case  with  read  and  write.  However,  even  on  machines  that  do 
not  possess  this  primitive,  this  type  of  transformation  can  still  be  performed. 
In  particular,  the  procedure  and  entry  calls  can  both  be  removed.  However, 
since  the  read  and  increment  operations  that  form  the  fetch.and.add,  cannot 
be  performed  atomicaJly,  semaphores  must  be  used  to  ensure  that  concurrent 
fetch. and_adds  do  not  interfere  with  each  other. 

Translated  subprogram  calls  to  read,  write,  and  faa  are  given  in  Table  3.1. 
Note  that  although  we  have  eliminated  the  querying  of  status  variables,  we  do 
need  to  flush  any  Ada  shared  variables^  associated  with  the  caJhng  task  that 


*An  Ada  shared  variable  is  a  variable  that  is  not  declared  by  a  pragma  as  shared,  but 
nevertheless  is  used  in  a  shared  nmnnei. 
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Original  Code 

Transformed  Code 

X  :=  my -beacon,  read; 

flush  current  task's  Ada  shared  variables 
X  :=  my -beacon. value; 

my -beacon. ■write(x); 

flush  current  task's  Ada  shared  variables 
my -beacon. value  :=  x; 

X  :=  my . beacon. faa(inc); 

flush  current  task's  Ada  shared  variables 
X  :=  /etc/i_an(f_ad(f  (my -beacon. value,inc); 

Table  3.1:  Code  Fragments  for  read,  write,  and  faa  Subprograms 


have  been  stored  locally.  This  observation  is  also  true  vrith  all  previous  variants 
of  the  beacon  task  and  was  first  attributed  to  Robert  Dewar  in  [Hum88].  The 
need  for  this  cache  flush  is  because  a  fetch-and-add-Hke  rendezvous,  as  with  any 
other  rendezvous,  is  a  synchronization  point  for  Ada  shared  variables.  This 
implies  that  any  local  copies  of  Ada  shared  variables  of  the  calling  task  must  be 
stored  into  global  memory  before  and  after  a  rendezvous  occurs  (ARM  9.11(7)). 
The  only  variable  used  in  a  shared  fashion  that  is  associated  with  the  gen- 
Jaeacon  package  is  v.  While  it  is  possible  to  keep  local  copies  of  this  variable  for 
each  task  that  has  access  to  it,  the  usage  of  this  variable  seems  to  suggest  that 
storing  it  in  the  shared  global  memory  would  be  a  better  choice.  Therefore,  there 
are  no  Ada  shared  variables  associated  with  the  gen_beacon  package.  Although 
the  gen-beacon  package  provides  a  means  to  obtain  an  integer  shared  variable, 
there  is  no  guarantee  that  the  user  will  not  employ  Ada  shared  variables  else- 
where. Thus,  each  read,  write,  and  fetch-and_add  is  preceded  by  a  flush  of  any 
of  the  user's  Ada  shared  variables. 
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generic 

queue  jize  :  integer; 

type  item  is  privBte; 
packfkge  paralleljqueue  is 

o\ERFLOW,  UNDERFLOW  :  exception; 

procedure  insert(data  :  item); 

function  delete  return  item; 
end  parallel.queue; 

package  body  parallel.queue  is 

q     array(0  .queue_size  -  1)  of  item; 
pragma  voLATlLE(q); 

package  nextJnsert  is  new  gen  J)eacon(O); 
package  next.delete  is  new  genJ)eacon(O); 
package  numojpper  is  new  gen_beacon(0), 
package  numJower  is  new  gen J)eacon(O); 

procedure  test-increment  j'eset(delta,  bound  :  in  integer)  is  . . . 
procedure  insert(data  ;  item)  is  . . . 

procedure  test.decrement  jeset(delta  :  in  integer)  is  . . . 
function  delete  return  item  is  . . . 
end  parallel.queue, 

Figure  3.10:  The  parallel.queue  Generic  Package 

is  used  to  solve  the  teisk  initialization  problem.  In  this  section  we  illustrate 
by  excLniple  how  a  "'fetch.and  jdd  algorithm"  would  appear  in  Ada.  when  the 
gen. beacon  construct  is  employed. 

The  algorithm  we  choose,  given  by  Gottlieb  ei  al.  in  [GGK*S3].  provides 
a  means  of  management  of  "highly  paxaJlel  queues".  Since  the  algorithm  was 
originally  described  as  a  collection  of  Pascal-like  routines,  it  seems  intuitive 
to  collect  these  routines  cind  their  data  structures  into  an  Ada  package.  This 
package,  shown  in  Figure  3.10.  is  constructed  in  a  generic  fashion  with  two 
peirameters:  queue_size  and  item,  corresponding  to  the  maximum  size  of  the 
parcdlel  queue  and  the  type  of  item  to  be  stored  in  the  queue. 
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package,  shown  in  Figure  3.10,  is  constructed  in  a  generic  fzishion  with  two 
parameters:  queue_si2e  and  item,  corresponding  to  the  maximum  size  of  the 
parallel  queue  and  the  type  of  item  to  be  stored  in  the  queue. 

Upon  instantiating  this  package,  two  subprograms  are  visible:  insert  and 
delete.  The  former  is  passed  a  parameter  of  type  item  to  be  inserted  at  the 
back  of  the  queue.  The  latter  returns  an  item  after  removing  it  from  the  front 
of  the  queue.  If  an  attempt  is  made  to  insert  into  a  full  queue,  the  exception, 
OVERFLOW  is  raised.  Similarly,  an  attempt  to  delete  an  item  from  an  empty 
queue  will  raise  the  exception,  UNDERFLOW.  The  insert  and  delete  routines  are 
specified  so  that  these  operations  are  performed  in  a  "highly  parallel"  manner. 

In  the  package  body,  we  represent  a  queue  of  length,  queue_si2e,  by  a  cir- 
cular array.  Since  concurrent  inserts  and  deletes  are  desired,  q  is  used  in  a 
shared  manner.  Although  Ada  does  not  allow  composite  objects  to  be  de- 
clared pragma  shared  (ARM  9.11(10)),  others  feel  this  restriction  is  too  severe 
[Dew90,Shu87,Hum88].  Even  though  atomic  updates  to  the  entire  composite 
object  are  not  possible  on  most  machines,  atomic  updates  to  a  component  of 
the  object  are  possible,  and  therefore,  in  the  spirit  of  ARM  9.11(11).  In  the  work 
of  Hunamel  [HumSS],  a  new  pragma,  volatile,  is  introduced  as  a  means  of  mark- 
ing those  composite  objects  that  are  used  in  a  shared  fashion.  In  this  work,  we 
subscribe  to  this  usage. 

In  addition  to  the  shared  array,  q,  the  algorithm  utilizes  four  integer  shared 
variables.  Since  fetch_and_add  operations  are  performed  on  these  variables,  the 
genJieacon  construct  is  used  to  realize  them.  The  shared  variables  represented 
by  the  packages,  next.insert  and  next.delete,  correspond  to  the  front  and  rear 
of  the  queue,  respectively.    The  initial  value  passed  to  these  packages  is  zero. 
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procedure  testJncrement-reset(delta,  bound  :  in  integer)  b 

dummy  :  integer;  —  used  to  receive  unwanted  faa  values 
begin 

if  num_upper.read  +  delta  <=  bound  then  —  Any  room? 

if  num-upper  faa(delta)  <=  bound  then Try  to  claim  it. 

return  true; 

else num_upper  incremented  between  two  if  s 

dummy  :=  num_upper.faa(-delta); reset  num.upper 

return  false; 
end  if, 
else 

return  false; 
end  if; 
end  test-incrementjeset; 

procedure  insert(data  :  item)  is 

myjnsert,  dummy  :  integer; 
begin 

if  testJncrement.reset(l,  queuejize)  then  —  try  to  find  room 
myjnsert  :=  next. insert. faa(l)  mod  queuejize; 

Wait  turn  at  myjnsert 

Q(myjnsert)  :=  data; 

dummy  :=  num. lower. faa(l); insert  is  complete 

else  —  no  room  available 

raise  o\  ERFLOW; 
end  if, 
end  insert, 

Figure  3.11:  The  insert  and  testjncrement_reset  Routines 

respectively.  These  two  \-ariables  never  differ  by  more  than  the  number  of  ac- 
tive insertions  and  deletions  [GGK"''S3].  Both  counters  are  initialized  to  zero, 
indicating  an  empty  queue. 

Figure  3.11  specifies  the  insert  and  testJncrement.reset  routines.  Before  con- 
sidering thf  insert  procedure,  we  turn  our  attention  to  the  testJncrement.reset 
function.  The  goal  of  test  Jncrcment  jeset  is  to  gusirzmtee  and  reserve  space  for 
an  insertion  into  the  queue.    As  originally  specified  in  [GGK'*'S3:.  this  routine 
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Figure  3.11  specifies  the  insert  and  test_increment_reset  routines.  Before  con- 
sidering the  insert  procedure,  we  turn  our  attention  to  the  testJncrement.reset 
function.  The  goal  of  testJncrement.reset  is  to  guarantee  and  reserve  space  for 
an  insertion  into  the  queue.  As  originally  specified  in  [GGK"''83],  this  routine 
takes  three  parameters  as  input:  a  shared  variable,  a  delta  value,  and  an  up- 
per bound.  For  the  call  located  in  insert,  these  parameters  correspond  to  the 
numjjpper  counter,  1,  and  queue_size,  respectively.  If  the  upper  bound  reports 
that  no  space  exists,  then  test.increment_reset  returns  the  value  false  to  insert, 
■which  in  turn  raises  the  OVERFLOW  exception. 

If  test_increment_reset  returns  true,  we  are  not  only  guaranteed  that  space 
exists  somewhere  in  q  to  insert  the  new  item,  but  also  that  this  space  has  been 
reserved.  To  acquire  this  space,  a  fetch_and-add  is  performed  on  the  insertion 
index,  nextJnsert,  yielding  a  unique  value  which  is  stored  in  the  local  variable, 
myJnsert.  This  value  modiilo  queue^ize  is  the  index  of  q  where  the  item  is 
stored. 

It  is  possible  that  a  delete  operation  has  not  finished  removing  the  previous 
element  stored  at  this  index.  Therefore,  we  must  wait  until  we  are  assured  that 
this  element  is  free  to  use.®  Once  q(myjnsert)  is  no  longer  occupied,  data  is  in- 
serted and  num  Jower  is  incremented  atomicaUy  via  the  fetch_and_add  construct; 
this  signifies  that  the  element  has  been  inserted. 

The  delete  procedure,  shown  in  Figure  3.12,  is  constructed  in  a  similar  man- 
ner. It  also  utilizes  a  function  to  determine,  and  reserve,  an  item  to  be  removed 

®This  can  be  done  by  using  two  semaphores,  insert  and  delete,  for  each  element  of  the  array. 
Although  these  semaphores  can  be  represented  in  Ada,  we  omit  their  specifications  in  this 
example.  Before  an  insert  operation  is  performed,  a  P  operation  is  performed  on  the  insert 
semaphore.  The  corresponding  V  operation  is  performed  after  the  item  is  deleted.  Similar 
actions  are  taken  for  the  delete  semaphore.  For  more  details,  see  [GGK''"83]. 
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procedure  test-decrement. reset(delta  :  in  integer)  is 

dummy  :   integer; 
begin 

if  num Jower.read  -  delta  >=  0  then  —  anything  to  delete? 

if  numJower.faa(-delta)  >=  0  then try  to  claim  it 

return  true; 
else  —  numJower  decremented  between  ifs 

dummy  :=  numJower.faa(delta); reset  numJower 

return  false; 
end  if; 
else 

return  false; 
end  if; 
end  test_decrement-reset; 

function  delete  return  item  is 
my.delete,  dummy  :  integer; 
begin 

if  test.decrement-rest(l)  then  —  see  if  an  item  exists 
my.delete  :=  next.delete  faa(l)  mod  queue_size; 

Wait  turn  at  my.delete 

data  :=  Q(my.delete); 

dummy  :=  num-upper.faa(-l); delete  is  complete 

else  —  no  items  to  delete 

raise  INDERFLOW; 
end  if; 
end  delete; 

Figure  3.12:  The  delete  and  test.decrementjeset  Routines 
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an  attempt  to  claim  this  item.  Assuming  this  is  successful,  true  is  returned  to 
delete.  This  allows  delete  to  remove  this  item  ajid  decrement  the  num_ijpper 
counter. 

Both  inserts  and  deletes  can  be  performed  in  parallel.  Since  shared  \-ari- 
ables  axe  accessing  using  the  fetch_and_add  construct,  potential  bottlenecks  are 
avoided.  Thus,  the  insert  and  delete  routines  provide  a  mechanism  to  support 
highly  parallel  queues. 

3.4.1      An  Apparent  Shortcoming  of  GenJBeacon 

Consider  the  test_increment_reset  routine  shown  in  Figure  3.11.  As  previously 
mentioned,  the  original  testJncrement_reset  routine  has  three  parameters  as  in- 
put: a  shared  \-ariable,  a  delta  value,  and  an  upper  bound.  However,  in  Fig- 
ure 3.11  we  assume  the  first  parameter  is  always  num_upper.  While  this  is  the 
case  with  this  particular  algorithm,  one  would  hope  to  write  the  test.increment- 
.reset  procedure  in  a  more  general  manner. 

Intuitively,  one  would  expect  to  pass  the  shared  variable  by  reference  to  the 
test_increment_reset  procedure.  However,  by  using  an  instantiation  of  a  package 
to  represent  a  shared  variable,  we  lose  the  abihty  to  pass  references  of  this 
\-ariable:  packages  are  not  first  class  objects.  Thus,  it  appears  that  we  axe 
unable  to  specify  test  jncrement_reset  in  a  general  way. 

This,  however,  is  not  the  case.  As  Figure  3.13  illustrates.  testJncrement _reset 
cam  be  specified  as  a  generic  procedure.  Instead  of  passing  the  shared  \'ariable 
to  this  procedure,  we  provide  the  means  to  access  this  variable  by  passing  two 
functions.  The  functions  correspond  to  a  read  and  a  fetch_and_add  on  the  shared 
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unable  to  specify  testjncrement_reset  in  a  general  way. 

This,  however,  is  not  the  case.  As  Figure  3.13  iDustrates,  test_increment_reset 
can  be  specified  as  a  generic  procedure.  Instead  of  paissing  the  shared  variable 
to  this  procedure,  we  provide  the  means  to  access  this  variable  by  passing  two 
functions.  The  functions  correspond  to  a  read  and  a  fetch.and.add  on  the  shared 
variable  of  interest.  This  provides  the  test-increment_reset  procedure  a  means 
to  access  the  shared  variable  without  specifying  the  variable  itself.  Thus,  a 
subprogram  can  instantiate  this  generic  function,  and  later  call  it,  using  the  read 
and  fetch.and.add  routines  that  are  made  available  by  the  gen_beacon  package. 
The  test.decrement  jeset  routine  is  treated  in  a  similar  manner.  Figure  3.14  gives 
the  revised  insert  and  delete  routines  using  the  generic  functions  of  Figure  3.13. 
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generic 

with  function  read.beacon  return  integer; 
with  function  faaJ)eacon  return  integer; 
procedure  test_increment_reset(delta,  bound  :  in  integer)  is 

dummy  :  integer; 
begin 

if  readJjeacon  +  delta  <=  bound  then 
if  faa_beacon(delta)  <=  bound  then 

return  true; 
else 

dummy  :=  faa_beacon(-delta); 
return  false; 
end  if; 
else 

return  false; 
end  if; 
end  test_increment_reset; 

generic 

with  function  read.beacon  return  integer; 
with  function  faa_beacon  return  integer; 
procedure  test.decrement_reset(delta  :  in  integer)  is 

dummy  :  integer; 
begin 

if  read-beacon  -  delta  >=  0  then 
if  faaJ3eacon(-delta)  >=  0  then 

return  true; 
else 

dummy  :=  faaj3eacon(delta); 
return  false; 
end  if; 
else 

return  false; 
end  if; 
end  test-decrement.reset; 

Figure  3.13:  Generic  Version  of  test_increment.reset  and  test  .decrement  _reset 
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procedure  insert(data  :  item)  is 
myjnsert,  dummy  :  integer; 
function  num.upper.tir  is  new 

test-increment-reset(num_upper.read,  num_upper.faa); 

begin 

if  num_upper.test-increment.reset(l,  queue.size)  then 

myjnsert  :=  next-insert. faa(l)  mod  queue_size; 

Wait  turn  at  myjnsert 

Q(myjnsert)  :=  data; 

dummy  :=  numJower.faa(l); 
else 

raise  overflow; 
end  if; 
end  insert; 

function  delete  return  item  is 
my.delete,  dummy  :  integer; 
function  numJowerJdr  is  new 

test. decrement -reset(numJower. read,  num. lower. faa); 

begin 

if  num Jower Jdr(l)  then 

myjdelete  :=  next.delete  faa(l)  mod  Queue.Size; 

—  Wait  turn  at  my.delete 

data  :=  Q(myJelete), 

dummy  :=  num. upper. faa{-l), 
else 

raise  underflow; 
end  if, 
end  delete. 

Figure  3.14:  The  revised  insert  and  delete  Routines 


Chapter  4 


The  Gen_Minitask  Package 


A  significant  shortcoming  of  the  Ada  tasking  model  is  the  inabihty  to  specify 
loop-level  parallelism  in  an  efficient  manner.  This  shortcoming,  first  discussed 
in  Chapter  2,  impedes  the  use  of  Ada  as  a  language  for  expressing  a  large  class  of 
parallel  algorithms,  namely,  those  algorithms  that  utilize  loop-level  parallelism. 
In  this  chapter,  we  introduce  the  genjninitask  package  to  overcome  this  short- 
coming. This  package  is  similar  to  the  gen_beacon  package,  described  in  Chap- 
ter 3,  in  that  both  of  these  packages  supply  efficient  solutions  to  well  known  Ada 
shortcomings,  without  modifying  the  language.  In  order  to  understand  the  need 
for  this  package,  we  discuss  various  types  of  parallelism  before  illustrating  why 
Ada  is  lacking  in  the  loop-level  variety. 

4.1      Types  of  Parallelism 

An  important  characteristic  of  a  parallel  algorithm  is  the  manner  in  which  the 
parallelism  is  expressed.  Of  particular  significance  is  the  granularity  or  grain  size 

56 
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of  the  parallel  construct  that  is  used.  Although  granularity  has  been  defined  as 
"the  basic  size  of  a  process  chosen  for  parallehsm"  [Kri89],  synchronization  and 
communication  costs  must  also  be  considered. 

We  define  the  overhead  associated  with  a  parallel  construct  as  the  amount 
of  additionzJ  computation  that  is  reqmred  to  initiate  and  terminate  the  parallel 
threads  that  are  associated  with  the  construct.  Overhead  plays  a  significant 
role  in  determining  the  granularity  of  a  construct.  From  it  we  can  determine 
the  minimum  amount  of  computation  that  each  thread  must  execute  to  ensure 
that  the  expected  execution  time  is  reduced,  compared  to  sequential  execution. 

Granularity  can  be  divided  into  three  levels:^ 

coarse  or  procedure-level:  A  thread  at  this  level  of  granularity  can  have  up 
to  the  functional  equivalent  of  an  operating  system  process.  It  can  contain 
calls  to  subprograms,  access  globed  variables,  and  perform  synchronization 
and  communication  with  other  executing  threads.  However,  due  to  this 
robust  functionality  a  significant  amount  of  overhead  is  required  to  manage 
these  heavy-weight  threads.  Thus,  this  level  of  parcdlelism  is  only  utilized 
when  each  thread  requires  this  full  functionality,  and  is  going  to  execute  a 
significant  amount  of  computation  to  offset  the  corresponding  overhead. 

medium  or  loop-level:  This  form  of  parallelism  severely  hmits  the  amount 
of  communication  and  synchronization  that  a  thread  is  allowed.  Due  to 
this  limitation,  this  level  of  parallelism  typically  requires  less  overhead. 
Thus,  a  sequence  of  statements,  or  a  loop  iteration,  often  supplies  enough 
computation  to  olTset  this  overhead. 


'Although  most  peoplf  agifc  that  three  levels  of  par&llelisni  exist,  there  is  no  consensus  as 
to  what  these  three  levels  arc.  Sec  [Kri89j  for  an  alterative  definition. 
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Parallel  Do  i  =  1.  N 

End 

Figure  4.1:  A  Typical  Parallel  Loop 

fine  or  instruction-level:  At  this  level,  parallelism  is  obtained  by  expanding 
the  width  of  a  machine  instruction.  Severed  operations  are  executed  con- 
currently within  one  instruction  on  distinct  processing  units.  Machines 
that  support  this  form  of  paralleUsm  are  know  as  VLIW  (Very  Long  In- 
struction Word)  machines.  As  this  form  of  parallelism  is  directly  accepted 
by  the  machine,  no  overhead  is  required  to  initiate  a  parallel  thread.  The 
difficulty  hes  in  finding  enough  operations  that  can  be  performed  in  parallel 
by  one  instruction  [E1186,EN90]. 

Parallelism  in  Ada  is  of  the  coarse  grain  vaxiety.  The  Ada  task  is  a  "procedure- 
like" object  that  executes  in  parallel  with  other  active  tasks  in  the  program.  As 
described  in  Chapter  2,  the  rendezvous  mechanism  provides  a  manner  in  which 
synchronization  and  communication  can  be  performed  between  two  otherwise 
asynchronous  tasks.  Although  the  Ada  tasking  model  is  well-suited  for  those 
applications  that  utilize  coarse  grain  paralleHsm,  many  scientific  applications 
require  the  lower  overhead  associated  with  loop-level  parallelism. 

Figure  4.1  illustrates  a  typical  parallel  loop.  The  programimer  specifies  that 
all  N  iterations  can  be  executed  in  parallel.  Although  no  "parallel  loop"  con- 
struct exists  in  Ada,  one  Ccin  use  tasks  to  specify  a  semantically  equivalent  entity. 
Constructing  a  parallel  loop,  like  many  other  problems  in  parallel  processing, 
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requires  that  each  task  acquires  a  unique  identity.  In  the  case  of  a  parallel  loop, 
this  identity  is  used  as  the  iteration  value  of  the  loop.  As  chscussed  in  Chap- 
ter 2,  a  major  shortcoming  of  Ada  is  the  inability  to  distribute  task  identities  in 
parallel,  (the  task  initialization  problem). 

The  serial  bottleneck  it  implies  increases  the  amount  of  overhead  associated 
with  the  task  initialization.  Since  this  overhead  is  of  more  importance  with 
loop-level  parallehsm  than  with  a  more  coarse  grain  type,  this  shortcoming  is 
most  critical  in  the  construction  of  a  parallel  loop. 

In  Chapter  3  we  presented  a  solution,  the  gen.beacon  idiom,  to  this  problem 
using  the  fetch_and_add  primitive.  Although  this  idiom  is  used  in  subsequent 
sections  of  this  chapter,  we  first  describe  the  manner  in  which  a  parallel  loop 
can  be  constructed  without  using  the  gen. beacon  construct. 

Figure  4.2  shows  how  one  might  achieve  the  effects  of  a  parallel  loop  in  Ada. 
A  task  type,  iteration.task  is  declared;  it  corresponds  to  an  iteration  of  the  loop. 
This  task  contains  one  entry  point,  getJd,  which  is  used  to  receive  its  identity 
or  iteration  value.  This  is  followed  by  the  declaration  of  an  array  of  N  tasks, 
corresponding  to  the  N  iterations  of  the  loop. 

After  the  task  body  is  specified,  iteration  values  of  the  "parallel  loop"  are 
distributed  to  each  of  the  iteration.tasks  by  the  for  loop.  Iteration  :  of  this  loop 
calls  the  get.id  entry  of  the  i*^  task,  supplying  it  with  the  iteration  value  t. 
This  distribution  of  task  identities  is  done  sequentially,  forming  an  undesirable 
bottleneck  in  our  "parcdlel  loop". 

Even  though  we  have  suggested  an  idiom  to  remove  this  serial  bottleneck,  a 
source  of  inefficiency  still  exists;  loop-level  parallehsm  is  being  expressed  using 
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declare 

task  type  iteration.task  is 

entry  get-id(id     in  integer); 
end  iteration.task; 

parJoop  :  array  (IN)  of  iteration.task, 

task  body  iteration.task  is 

myjd      integer; 
begin 

accept  getJd(id  :  in  integer)  do 

myJd  :=  id, Acquire  an  id  from  parent. 

end  get.id. 

5, 

end  iteration.task, 

begin  —  All  "iterations"  begin  to  execute. 

for  i  in  IN  loop 

par.loop(i).getJd(i);  —  Identities  are  distributed  sequentially. 

end  loop, 
end,  —  Wait  for  all  "iteration"  tasks  to  terminate. 

Figure  4.2:  An  Ada  "Parallel  Loop" 

a  coarse  grained  construct,  the  Ada  task.  While  the  introduction  of  a  loop-level 
parallel  construct,  i.e.  a  parallel  loop,  would  solve  this  problem,  our  goal  is  not  to 
modify  Ada.  but  to  use  the  current  language  to  express  this  form  of  parallelism 
efficientl}-.  Before  suggesting  how  this  cein  be  achieved,  let  us  review  the  process 
of  task  initiation  in  Ada. 


4.2      Task  Initiation 

This  section  describes  the  maimer  in  which  tasks  are  initiated.  Tasks  can  be 
initiated  statically  or  dynamically.  Task  initiation  can  be  broken  into  two  phases: 
task  crtation  and  task  activation.  The  former  corresponds  to  when  the  ta^k 
object  is  created,  while  the  latter  is  when  the  task  elaborates  its  declarative 
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declare 

task  type  worker  is  . . . 

type  ref-worker  is  access  worker; 

one.task  ;  worker;  —  Create  a  task. 

ten.task  :  array  (1. 10)  of  worker;        —  Create  ten  tasks. 

refJask  :   ref.worker; 

task  body  worker  is  . . . 

begin        --  Activate  one.task  and  ten_task(!),  i  G  I.-IO 

ref-task  :=  new  ref.worker;        --  Create  k  activate  a  worker  task, 
end  --  Wait  for  all  tasks  to  terminate. 

Figure  4.3:  Task  Creation  and  Acti\-ation  Example 

part  and  begin  its  execution. 

To  illustrate  these  two  notions,  consider  the  block  statement  of  Figure  4.3. 
During  execution,  the  declairative  part  of  this  block  is  elaborated  sequentially 
(ARM  3.9(3));  followed  by  the  execution  of  the  sequence  of  statements  (ARM 
5.6(4)). 

After  elaborating  the  task  type,  worker,  and  the  access  type,  ref.worker.  the 
task  object,  one.task  is  encountered.  Elaborating  a  task  object  corresponds  to 
creating  the  task.  i.e.  allocating  its  storage,  making  it  known  to  the  system,  etc.. 
but  not  starting  its  execution.  The  array  of  tasks,  ten.tasks  is  treated  similarly. 
Elaborating  this  array  is  just  like  elaborating  any  other  array  object  declaration, 
and  proceeds  as  follows  (ARM  3.2.1(4-8)): 

1.  "...  the  constrained  array  is  first  elaborated." 

2.  '". . .  any  implicit  initial  values  for  the  object  . . .  are  e\-aluated." 

3.  "The  object  is  created." 
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4.   "Any  initial  value  is  assigned  to  the  object." 

Thus,  ten  tasks  are  created  and  are  associated  with  the  array,  ten_tasks.  The 
reference  manual  does  not  specify  how  these  tasks  are  created,  so  they  may  be 
created  sequentially,  in  parallel,  or  some  combination  of  the  two.  As  is  the  case 
with  one.task,  none  of  these  tasks  may  start  their  execution  at  this  time. 

Next,  the  access  object,  ref.task,  is  created.  As  with  any  access  type,  unless 
an  initial  value  is  provided  by  a  call  to  the  allocator,  no  object  is  created.  Thus, 
no  task  is  created  when  ref.task  is  elaborated. 

The  elaboration  of  the  declarative  region  is  complete  once  the  body  of  worker 
is  elaborated.  This  elaboration  "has  no  other  effect  than  to  establish  that  the 
body  can  from  then  on  be  used  for  the  execution  of  tasks  designated  by  objects 
of  the  corresponding  task  type"  (ARM  9.1(6)). 

Upon  conclusion  of  this  elaboration,  the  sequence  of  statements  is  ready  to 
be  executed.  However,  prior  to  executing  these  statements,  all  tasks  that  were 
created  in  the  declarative  part  are  activated  (ARM  9.3(2)).  Activation  of  a  task 
"consists  of  the  elaboration  of  the  declarative  part,  if  any,  of  the  task  body"  (ARM 
9.3(10)).  After  a  teisk  has  been  activated,  it  can  start  executing  its  sequence 
of  statements;  it  need  not  wait  for  any  other  task  to  complete  its  activation. 
Although  different  task  objects  are  elaborated  sequentially,  activation  of  their 
corresponding  tasks  proceeds  in  parallel  (ARM  9.3(1)). 

Once  all  tasks  have  been  successfully  activated  and  begun  their  execution, 
the  assignment  of  ref.task  is  encountered.  Since  ref.task  is  an  access  variable, 
the  object  it  accesses  is  not  created  until  the  allocator,  new,  is  called.  When 
this  transpires,  a  task  object  is  dynamically  created,  and  then  activated.   Once 
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a  successful  activation  has  occurred,  an  access  to  the  value  of  the  task  is  stored 
in  ref_task.  Execution  of  both  the  rest  of  the  sequence  of  statements  and  the 
newly  activated  task  proceed  in  parallel.  Upon  reaching  the  end  statement,  the 
task  that  is  executing  the  block  statement  must  wait  for  the  twelve  other  tasks 
to  terminate,  if  they  haven't  already  done  so,  before  proceeding. 

Note  that  an  initial  value  for  the  access  variable  ref.task  could  have  been 
provided  in  the  declarative  part  by  using  a  call  to  the  allocator.  If  this  were  the 
case,  a  task  would  be  created  and  then  immediately  activated. 

4.2.1      Task  Dependence 

The  reference  manual  defines  the  related  concepts  of  task  dependence  and  a 
master  of  a  task  (ARM  9.4(1-3)): 

"Each  task  depends  on  at  least  one  master.  A  master  is  a  construct 
that  is  either  a  task,  a  currently  executing  block  statement  or  sub- 
program, or  a  library  package.  The  dependence  on  a  master  is  a 
direct  dependence  in  the  following  two  cases: 

•  The  task  designated  by  a  task  object  that  is  the  object,  or  a  sub- 
component of  the  object,  created  by  the  evaduation  of  aji  allo- 
cator depends  on  the  master  that  elaborates  the  corresponding 
access  type  definition. 

•  The  task  designated  by  any  other  task  object  depends  on  the 
master  whose  execution  creates  the  task  object." 
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Once  again,  consider  Figure  4.3.  The  master  of  the  single  task,  one.task, 
and  the  array  of  tasks,  ten_tasks,  is  the  block  that  declares  it.  Likewise,  each 
of  these  eleven  tasks  depends  on  this  block.  Although  the  task  that  is  created 
dynamically  is  also  dependent  on  this  block,  it  is  for  a  different  reason;  namely, 
the  access  type  definition  is  declared  in  this  block.  This  is  consistent  with  normal 
scope  rules  for  objects  created  by  an  allocator. 

The  concept  of  task  dependence  is  useful  in  determining  the  tasks  a  block 
must  wait  for  before  exiting.  This  is  precisely  all  tasks  that  are  dependent  on 
that  block,  i.e. 

•  all  task  objects  that  were  statically  declared  in  that  block  statement,  and 

•  all  tasks  that  were  created  by  an  allocator  (possibly  in  some  inner  frame) 
using  an  access  type  definition  that  was  declared  in  this  block. 

Returning  to  the  block  statement  in  Figure  4.3,  we  see  that  the  task  execut- 
ing this  block  statement  must  Wciit  for  the  tasks  associated  with  one.task  and 
ten_tasks.  In  addition,  any  tasks  associated  with  the  access  type,  ref_worker, 
must  also  terminate  prior  to  the  completion  of  this  block. 

4.2.2      Tasks  and  Exceptions 

Since  exceptions  are  an  integral  part  of  Ada,  a  reasonable  question  to  ask  is: 

How  are  tasking  semantics  affected  by  the  presence  of  exceptions? 

The  reference  manual  defines  how  these  two  important  features  interact.  First, 
it  defines  what  happens  if  an  exception  is  raised  before  or  while  a  task  object  is 
elaborated: 
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"If  an  exception  is  raised  during  the  elaboration  of  the  declarative 
part  of  a  given  frame,  this  elaboration  is  abandoned."  ARM  11.4.2(1) 

Therefore,  task  objects  are  not  elaborated  when  the  elaboration  of  a  previous 
item  raises  an  exception. 

Next,  we  consider  the  scenario  when  an  exception  is  raised  after  a  task  object 
has  been  elaborated,  but  before  the  corresponding  task(s)  has  been  activated: 

"Should  aji  exception  be  raised  by  the  elaboration  of  a  declarative 
part  or  package  specification,  then  any  task  that  is  created  (directly 
or  indirectly)  by  this  elaboration  and  that  is  not  yet  activated  be- 
comes terminated  and  is  therefore  never  activated."  ARM  9.3(4) 

As  in  the  case  when  the  exception  is  raised  before  the  task  object  is  elaborated, 
no  tasks  are  activated. 

We  next  consider  the  case  when  no  exception  has  occurred,  and  a  task  object 
has  been  elaborated  successfully.  What  happens  if  an  exception  is  raised  during 
task  activation? 

"Should  an  exception  be  raised  by  the  activation  of  one  of  these  tasks, 
that  task  becomes  a  completed  task;  other  tasks  are  not  directly 
affected."  ARM  9.3(3) 

Thus,  if  an  exception  is  raised  during  the  activation  of  one  of  the  tasks  aissociated 
with  the  array  ten.tasks,  that  task  becomes  completed,  while  the  other  nine 
tasks,  in  addition  to  one  task,  arc  not  affected. 

How  does  an  exception  being  raised  during  the  activation  of  a  task  affect  the 
master  task? 
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"Should  one  of  these  tasks  thus  become  completed  during  its  ac- 
tivation, the  exception  TASKING_ERROR  is  raise  upon  conclusion  of 
the  activation  of  all  of  these  tasks  (whether  successfully  or  not);  the 
exception  is  raised  at  a  place  that  is  immediately  before  the  first 
statement  following  the  declarative  part  (immediately  after  the  re- 
served word  begin)."  ARM  9.3(3) 

This  quote  illustrates  that  the  constraint  that  the  master  must  wait  for  adl 
dependent  tasks  to  complete  their  activation  before  continuing  its  execution  is 
indeed  a  practical  one;  if  any  of  these  tasks  raise  an  exception  the  master  must 
raise  TASKING_ERROR  instead  of  continuing  its  normad  execution. 

What  happens  if  an  exception  is  raised  by  more  than  one  activating  task? 

"Should  several  of  these  tasks  thus  become  completed  during  their 
activation,  the  exception  TASKING_ERROR  is  raised  only  once."  ARM 
9.3(3) 

An  implementation  can  satisfy  this  rule  by  maintaining  a  boolean  for  the  master 
signifying  if  TASKING_ERROR  should  be  raised  upon  conclusion  of  the  all  of  the 
dependent's  activation. 

Notice  that  the  reference  manual  takes  a  different  view  when  a  task  raises 
an  exception  during  its  activation  as  opposed  to  when  it  is  being  created.  This 
is  because  activation  marks  the  beginning  of  a  task's  execution,  and  thus,  once 
this  has  begun,  abnormal  actions  in  it  should  not  affect  a  sibling  task. 

These  quotes  illustrate  that  the  initiation  of  a  task  is  a  two  stage  process. 
First,  the  task  objects  are  created,  and  then,  provided  no  exceptions  are  raised, 
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the  tasks  are  activated  and  begin  their  execution.  The  point  at  which  this  ac- 
tivation occurs  depends  on  whether  the  task  is  declared  staticcdly  or  zdlocated 
dynamically.  The  motivation  behind  this  is  that  if  a  problem  arose  creating  one 
or  more  tasks,  then  the  programmer  probably  does  not  want  "normal"  process- 
ing to  continue,  be  it  trying  to  create  more  tasks,  or  activating  the  ones  that 
already  have  been  created.  While  this  two  stage  process  is  beneficial  for  han- 
dhng  exceptions,  the  synchronization  point  it  mandates  reduces  the  usefulness 
of  Ada  in  specifying  loop-level  parallelism. 

Unfortunately,  this  is  not  the  only  reason  why  Ada  is  seemingly  unsuitable 
for  loop-level  parallelism. 

A  second  deficiency  in  the  Ada  tzisking  model  is  its  relative  coarseness,  as  it  is 
intended  primarily  for  progrcmiming  embedded  applications  with  relatively  small 
rate  of  intertask  communication.  As  has  been  often  noted,  the  tasking  model 
of  Ada  is  nevertheless  very  general  [Fra87,Hil82b,Blu81].  While  this  generality 
provides  the  programmer  with  a  powerful  mechanism  to  express  a  program's 
communication  and  synchronization  semantics,  the  overhead  it  requires  is  a 
hindrance  in  expressing  light-weight  paralleUsm. 

By  providing  a  sophisticated  method  of  task  communication  and  synchro- 
nization, an  Ada  implementation  must  maintain  additional  context  (informa- 
tion associated  with  rendezvous,  entry  queues,  task  priorities,  etc.).  Since  the 
full  generality  of  the  Ada  task  is  not  needed  to  specify  loop-level  parallelism  - 
parallel  loop  iterations  do  not  communicate  with  each  other  or  specify  different 
priorities  -  a  heavy  price  is  being  paid  for  something  that  is  not  required.  How- 
ever, if  a  compiler  can  ascertain  that  a  task  will  execute  in  a  light-weight  fashion 
(i.e.  with  no  synchronization  with  other  tasks),  then  it  can  implement  this  task 
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in  a  more  efficient  manner. 

In  order  to  be  able  to  express  loop-level  parallelism  in  Ada,  these  two  prob- 
lems must  be  overcome. 

4.3      Removing  the  Synchronization  Point 

As  we  have  just  seen  there  are  two  reasons  why  Ada  tasks  are  not  suitable  for 
expressing  loop-level  parallelism: 

•  the  synchronization  point  that  is  necessary  between  task  creation  and  task 
activation, 

•  the  significant  context  required  to  manage  an  Ada  task. 
In  this  section  we  attempt  to  overcomie  the  first  problem. 

4.3.1      An  Observation 

A  naive,  but  instructional  approach  to  removing  the  synchronization  point  re- 
quired between  task  creation  and  activation  is  to  use  tasks  in  a  dynamic  fashion. 
Consider  the  block  statement  shown  in  Figure  4.4.  The  declarative  part  con- 
tains three  type  declarations:  a  task  type,  an  array  of  ten  tasks,  and  an  access 
to  this  array.  After  these  declarations,  an  access  object,  ten_tasks,  is  declared, 
followed  by  the  body  of  worker.  Since  the  declarative  part  contains  no  task  ob- 
jects, upon  reaching  the  begin  statement,  the  sequence  of  statement  is  executed 
immediately;  the  task  executing  this  block  statement  does  not  have  to  wait  for 
any  tasks  to  be  activated. 
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declare 

task  type  worker  is  . . . 

type  loop-tasks  is  array  (1.. 10)  of  worker; 

type  refJoop-tasks  is  access  loop.tasks; 

ten.tasks  :  access_loop_tasks; 

task  body  worker  is  . . . 
begin  —  All  tasks  are  activated 

ten.tasks  :=  new  loop.tasks; 

end  —  Wcdt  for  all  tasks  to  terminate 

Figure  4.4:  A  Dyncimic  "Parallel  Loop" 

Consider  the  assignment  statement.  The  right  hand  side  of  this  statement 
contains  a  dynamic  allocation  of  an  array  of  ten  tasks.  This  is  processed  in  the 
following  manner  (ARM  9.3(6)): 

1.  The  array  is  created. 

2.  The  ten  tasks  of  the  array  are  created. 

3.  The  ten  tasks  are  activated. 

4.  The  access  value  is  zissigned  to  the  variable  ten.tasks. 

Although  the  first  two  steps  may  be  interchanged,  the  reference  manual 
clearly  states  that  the  activation  of  these  tasks  can  not  begin  until  initicdization 
of  the  object  has  completed  (ARM  9.3(6)).  In  particulzir,  all  the  tasks  must  be 
created.  These  semantics  are  consistent  with  static  tasking  semantics;  a  distinct 
synchronization  point  exists  during  the  initiation  of  a  task  object,  so  that  if 
an  exception  has  been  raised  before  that  point,  then  no  tasks  are  activated. 
For  statically  declared  tasks,  this  point  is  the  begin  statement  of  the  frame 
that  declares  the  task.    For  dynamic  tasks,  this  point  is  right  after  the  tasks 
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have  been  created.  Therefore,  this  proposed  solution  has  not  eliminated  the 
synchronization  point,  merely  relocated  it. 

Although  it  seems  that  we  have  the  same  scenario  as  in  the  c&se  of  statically 
declared  tasks,  there  is  one  key  difference:  the  semantics  of  the  allocator  specify 
that  task  activation  takes  place  immediately  after  the  tasks  have  been  created. 
When  task  objects  are  statically  declared,  this  is  not  necessarily  the  case;  if  any 
declarative  items  follow  the  task  declaration,  they  must  be  elaborated,  without 
an  exception  being  raised,  before  any  task  is  activated. 

In  our  quest  to  reduce  the  overhead  of  task  initiation,  the  knowledge  that 
activation  is  performed  once  creation  has  been  successfully  completed,  may  pro- 
vide a  source  of  optimization.  As  we  shall  see,  this  type  of  knowledge  is  used  in 
the  construction  of  the  gen.minitask  package. 

4.4      Reducing  Tasking  Context 

As  previously  discussed,  a  reason  why  the  Ada  task  is  not  suitable  for  loop-level 
parallelism  deals  with  its  granularity.  The  functionality,  and  thus  granularity, 
of  the  Ada  task  is  not  something  that  we  have  the  power  to  modify.  Therefore, 
we  search  for  other  meajis  that  a  programmer  can  inform  the  compiler  that  a 
light-weight  version  of  the  Ada  task  can  be  used. 
There  are  several  ways  this  can  be  achieved: 

1.  By  using  the  pragma  faciUty,  a  programmer  can  designate  a  task  type  as 
being  light-weight.  In  doing  so  the  implementation  can  perform  optimiza- 
tions that  reduce  the  context  that  is  required  for  objects  of  this  type. 
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2.  The  burden  of  tasking  efficiency  can  be  kept  on  the  compiler,  by  requiring 
it  to  detect  when  a  tcisk  is  used  in  a  light-weight  manner. 

3.  We  can  construct  an  Ada  idiom  that  allows  the  efficiency  of  loop-level 
parcdleHsm  to  be  reaJized  without  sacrificing  portabihty. 

4.4.1      The  Pragma  Approach 

By  designating  a  task  as  being  light-weight,  the  compiler  can  reduce  the  context 
that  must  be  maintained  for  that  task.  However,  consider  what  happens  if  that 
task  subsequently  attempts  an  operation  that  the  Hght-weight  task  does  not  sup- 
port. Assuming  that  the  compiler  does  not  attempt  to  detect  these  operations, 
an  error  will  occur.  Moreover,  when  this  program  is  run  on  a  machine  that  does 
not  support  this  pragma  the  compiler  will  simply  ignores  the  pragma  and  treats 
the  task  as  a  regular  Ada  task.  If  a  heavy-weight  operation  is  attempted  no 
error  will  occur.  Thus,  this  program  exhibits  different  behavior  depending  on 
the  system  it  is  executed  on  and  appezirs  to  be  non-portable. 

Hilfinger  remarks  that  this  hypothetical  program  is  incorrect  and  "that  the 
portability  of  incorrect  programs  has  never  been  a  serious  concern"  [HilQlj.  He 
puts  a  further  damper  on  the  "lack  of  portability"  argument  by  pointing  out  that 
the  language  maintenance  body  (ARG)  has  deemed  that  "pragmas  are  allowed 
to  cause  a  program  to  become  erroneous"  [Hil91].  Thus,  it  appears  that  this 
approach  is  a  reasonable  one  to  pursue. 
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4.4.2      The  Compiler  Detection  Approach 

For  a  compiler  to  detect  if  a  task  is  light-weight,  it  would  have  to  determine  if 
any  heavy-weight  tasking  operations  (rendezvous,  abort,  etc.)  are  performed. 
In  order  to  detect  all  occurrences  of  light-weight  tasking,  tasks  that  could  po- 
tentially perform  heavy-weight  operations  would  be  considered  in  an  attempt  to 
determine  whether  these  operations  are  actually  performed  at  execution  time. 
If  they  were  not  performed,  this  task  could  be  deemed  hght-weight. 

In  order  to  determine  if  a  task  is  going  to  perform  a  rendezvous  or  be  aborted, 
knowledge  of  the  actual  execution  is  required.  Therefore,  detecting  light-weight 
tasking,  in  general,  is  not  possible.  Despite  this  grim  news,  one  may  wish  to 
limit  the  tasks  that  are  detected  to  a  subset  of  all  light-weight  tcisks. 

By  taking  this  less  ambitious  approach,  a  compiler  may  be  able  to  detect 
some  uses  of  light-weight  tasking:  those  tasks  that  do  not  contain  any  "heavy- 
weight" tasking  operations.  Although  it  is  possible  to  statically  check  if  a  task 
type  contains  an  accept  statement  or  entry  Ccill,  determining  if  the  task  can  ever 
be  aborted  would  require  inspection  of  cJl  uses  of  the  task  type.  This  type  of 
analysis  is  not  only  undesirable,  but,  in  general,  impossible  when  the  compilation 
unit  is  a  library  package. 

Another  problem  with  this  approach,  as  is  the  case  with  parallehzing  com- 
pilers, is  that  the  programmer  is  dependent  solely  on  the  compiler  for  choosing 
the  correct  form  of  paraJlehsm,  be  it  a  regular  Ada  task  or  a  light-weight  task. 
The  programmer  may  know  that  a  particular  task  is  being  used  in  a  light-weight 
fashion,  but  since  there  is  no  means  of  communicating  this  fact  to  the  compiler, 
it  may  go  undetected.  While  this  "hcinds  off"  approach  to  parallelism  only  makes 
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correct  transformations,^  it  ignores  a  vital  asset  in  the  design  of  programs:  the 
programmer's  knowledge  of  the  program. 

For  these  reasons,  compiler  detection  of  light-weight  tcisking  is  rejected. 

4.4.3      The  Idiom  Approach 

The  method  of  choice  in  this  work  is  to  construct  an  Ada  idiom  such  that  its 
implementation  can  be  optimized  into  an  efficient  rccdization  of  loop-level  paral- 
lehsm.  Since  this  idiom  is  written  solely  in  Ada,  portability  is  mciintained.  This 
approach  is  similar  to  the  pragma  approach  in  that  the  programmer  designates 
where  loop-level  parallelism  is  being  employed.  However,  it  differs  in  that  we 
are  able  to  provide  an  increase  in  functionality,  as  well  as  an  efficient  implemen- 
tation. The  remainder  of  this  chapter  describes  this  idiom  and  the  increase  in 
functionality  that  it  provides.  Its  implementation  is  described  in  the  Chapter  5. 

4.5      Our  Idiom  -  The  Gen_Minitask  Package 

In  this  section  the  gen.minitask  generic  package  is  introduced.  This  package 
provides  its  users  with  an  efficient  means  of  obtaining  loop-level  par3dlelism 
without  modifying  the  language.  Before  specifying  this  package,  we  specify 
several  constraiints  a  parallel  loop  idiom  should  satisfy  (See  Table  4.1). 

Consider  the  first  constraint.  As  the  task  is  the  only  parallel  construct  in 
Ada,  it  will  form  the  basis  for  any  idiom  we  suggest.  In  the  course  of  presenting 
our  idiom,  we  show  that  all  five  of  these  constraints  are  satisfied. 


'provided  the  compiler  is  correct,  of  course! 
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1.  It  should  be  written  in  Ada. 

2.  It  shoiild  be  usable  in  the  same  spot  as  a  parallel  loop. 

3.  The  semantics  of  the  idiom  shovild  be  the  same  as  those  of  a  parallel  loop. 

4.  It  should  be  constructed  so  that  optimizations  can  be  performed  to  obtciin  com- 
parable efficiency  of  a  parallel  loop. 

5.  Since  it  is  written  in  Ada,  a  programmer  should  be  able  to  determine  exception 
information  about  each  "iteration"  task,  and  about  all  "iterations". 

Table  4.1:  Five  Constraints  of  a  Parallel  Loop  Idiom 


procedure  loop-body(iteration.value  :  integer)  is 
begin 
Si; 

end 

Figure  4.5:  An  Example  of  a  Loop  Body  Procedure 
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Since  our  idiom  is  to  be  used  as  a  general  means  for  expressing  loop-level 
parallelism,  it  must  be  able  to  take  parameters,  specifically  the  loop  body  and 
the  range  of  iterations  to  be  executed.  Since  the  loop  body  is  a  sequence  of 
statements,  it  can  be  represented  as  a  procedure  as  shown  in  Figure  4.5.  This 
procedure  contains  one  parameter,  the  iteration  value  of  the  loop  it  is  executing. 
The  body  of  the  procedure  contains  the  sequence  of  statements  that  v^rould  have 
appeared  as  the  statements  of  the  parallel  loop.  As  is  the  case  writh  a  parallel 
loop,  these  statements  can  reference  the  value  of  the  loop  iteration. 

To  ease  in  our  presentation,  we  note  that  the  underlying  tasks  that  execute 
a  loop  iteration  are  called  minitasks.  A  minitask  is  simply  an  Ada  tcisk  object. 
A  collection  or  array  of  minitasks  that  simulate  a  parallel  loop  is  referred  to  as 
a  minitask  family. 

We  construct  our  idiom  in  a  similar  fashion  to  the  gen_beacon  generic  package 
discussed  in  Chapter  3.  By  using  a  generic  package  as  the  basis  of  our  construct, 
we  can  group  together  the  underlying  tasks  that  execute  the  loop  iterations  with 
two  functions  and  a  boolean  variable  that  summarize  the  completion  behavior 
of  the  these  tasks. 

Figure  4.6  gives  the  gen.minitask  package  specification.  It  uses  the  gen.beacon 
package  to  distribute  task  identities  to  each  task.  We  shall  see  the  manner  in 
which  this  is  done  when  we  discuss  the  body  of  this  package.  Two  parameters 
are  passed  to  the  gen  minitask  package:  a  type  and  a  procedure.  The  type 
corresponds  to  an  integer  range  over  which  the  iterations  of  the  parallel  loop 
span.  The  procedure  parameter  corresponds  to  the  body  of  the  parallel  loop. 
It  contains  one  parameter  of  type  iteration. range.  This  parameter  is  passed  the 
iterate  this  procedure  should  execute. 
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■with  gen_beacon; 
generic 

type  iteration.range  is  range  <>; 

with  procedure  loop_body(iterate  :  in  Iteration  .range); 

package  gen.minitask  is 

type  exception-type  is  (None,  Constraint,  Numeric,  Program, 

Storage,  Tasking,  Other); 
success  :  boolean  :=  true;  —  Minitask  family  completion  status 

—  These  functions  provide  access  to  status  information. 
function  task_completion(iterate  :  in  iteration.range) 

return  boolean; 
function  task-exception(iterate  :  in  iteratlon.range) 
return  exception-type; 
end  gen.minitask; 

Figure  4.6:  The  Visible  Part  of  Gen_Minitask 

The  visible  part  of  the  package  contciins  information  that  aid  an  applica- 
tion in  determining  the  completion  status  of  both  the  minitask  family  and  each 
minitask.  Upon  completing  the  execution  of  a  minitask  family,  the  instantiating 
subprogram  can  ascertain  the  following  information: 

•  If  an  exception  was  raised  in  the  minitask  family 

•  If  an  exception  was  raised  in  a  particular  minitask 

•  What  exception  vyas  raised  for  a  particular  minitask 

This  information  is  made  available  through  a  boolean  variable,  success  and 
two  functions,  task-completion  and  task-exception.  A  typical  use  of  this  infor- 
mation would  be  for  the  instantiating  task  to  first  inspect  the  boolean  variable, 
success,  to  determine  if  any  exception  was  raised  in  the  minitask  family.  If  this 
boolean  is  true,  then  all  minitasks  completed  successfully  and  normal  processing 
can  continue. 
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However,  if  success  is  false,  the  instantiating  task  may  want  to  determine 
where  the  exception(s)  was  rciised.  To  determine  if  an  exception  was  raised  for  a 
particular  iteration,  the  function,  task. completion,  is  called.  This  function  takes 
an  iterate  ais  a  parameter,  and  returns  a  boolean  value  corresponding  to  whether 
execution  of  this  iteration  raised  an  exception.  If  an  exception  was  raised  in  this 
iteration,  task_exception  can  be  Ccilled  to  determine  the  type  of  exception  that  was 
raised.  The  value  returned  by  this  function  is  of  exception.type,  an  enumerated 
type  which  is  made  visible  when  gen.minitask  is  instantiated. 

By  providing  this  information  our  idiom  satisfies  the  fifth  constraint  of  a 
parallel  loop  idiom  (Table  4.1).  Namely,  that  a  programmer  should  be  able  to 
determine  exception  information  about  the  parallel  loop  and  each  of  its  itera- 
tions. 

4.5.1      Gen_Minitask  Usage 

Before  considering  the  body  of  the  gen.minitask  package  we  illustrate  the  manner 
in  which  this  package  can  be  used.  Figure  4.7  gives  a  typical  usage  of  the 
gen_minitask  package.  It  simulates  a  parcJlel  loop  with  one  hundred  iterations. 
The  declarative  region  can  be  broken  down  into  three  parts: 

1.  The  constrained  subtype  declaration, 

2.  The  loop  body  declaration, 

3.  The  instantiation  of  the  generic  package. 

The  first  two  declarations  provide  the  parameters  for  the  gen  minitask  instan- 
tiation. Since  one  hundred  iterations  are  desired  a  integer  subtype  constrained 
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declare 

subtype  my_range  is  integer  range  1.100; 
procedure  myJoop_body(id  :  myjange)  is 
begin 
Si; 

Sn-. 

end  myJoop-body; 

package  myjninitask  is  new  gen_minitask(myj-ange,  myJoop.body); 
begin 

if  not  myjninitask. success  then 
for  i  in  my-range  loop 

if  not  task-completion(i)  then 

—  An  exception  was  raised  in  the  i^^  iterate. 

—  We  can  use  task-exception  to  determine 

—  which  exception. 
end  if; 

end  loop; 
end  if; 
end; 

Figure  4.7:  A  Tj-pical  Usage  of  gen_minitask 
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on  the  range  of  1..100  is  declared.^  In  the  body  of  the  gen.minitask  package  this 
range  is  used  to  distribute  iterates  to  each  teisk  executing  an  iteration.  As  is  the 
case  with  all  integer  ranges,  the  lower  bound  need  not  necessarily  start  at  one. 
An  application  can  have  its  iterations  span  over  any  integer  range  by  simply 
constructing  the  appropriate  integer  range  subtype. 

The  body  of  the  parcdlel  loop  is  represented  by  the  procedure  my.loop_body. 
The  statements  would  have  appezu-ed  in  the  parallel  loop  are  placed  in  this  pro- 
cedure. The  parameter  of  the  procedure  plays  the  role  of  the  iteration  variable. 
This  procedure  can  reference  any  visible  global  variables,  thus  allowing  minitasks 
to  collaborate  on  a  computation  on  composite  data  structures  or  to  communicate 
via  shared  variables. 

After  the  subtype  and  procedure  are  declared,  the  instantiation  of  the  gen_- 
minitask  can  occur.  This  package  is  constructed  so  that  the  following  occur  upon 
instantiation  (barring  an  exception  being  raised): 

1.  All  minitasks  are  created. 

2.  All  minitasks  are  activated. 

3.  All  minitasks  receive  a  unique  identity  in  the  subtype  range. 

4.  All  minitasks  execute  the  body  of  the  procedure  parameter. 

5.  All  minitasks  terminate. 

At  the  point  where  control  is  returned  back  to  the  instantiating  tjisk,  all  minitasks 
have  completed  their  execution  allowing  the  instantiating  task  to  inspect  the 


'Since  the  upper  and  lower  bounds  of  this  r&nge  need  not  be  constant,  a  subtype  is  used  even 
though  in  this  particular  ra.<>r  a  typr  is  sufficient. 
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status  of  the  minitask  family  in  the  manner  described  previously. 

Since  the  building  blocks  of  a  minitask  instance,  the  subtype  and  the  pro- 
cedure, are  declarative  items,  a  declarative  section  must  be  entered  in  order  to 
utilize  our  idiom.  At  first  glance  this  seems  to  violate  constraint  2  for  our  par- 
allel loop  idiom:  that  it  should  be  able  to  be  used  in  the  same  spot  as  a  parallel 
loop.  However,  since  Ada  allows  a  block  statement  to  be  placed  arbitrarily  in  a 
sequence  of  statements,  this  constraint  is  indeed  satisfied. 

4.5.2      The  GenJVIinitask  Body 

In  this  section  we  present  the  body  of  gen_minitask  package.  The  manner  in 
which  this  package  is  constructed  is  important  in  our  efficient  realization  of 
loop-level  paralleHsm;  it  allows  for  an  optimized  implementation  of  an  instance 
of  this  generic  package. 

The  gen.minitask  package  as  specified  in  Figure  4.8  declares  two  types  and 
an  array  object.  As  mentioned  previously,  the  minitask  is  the  underlying  task 
that  executes  an  iteration.  The  record  type,  task_status_record,  is  used  by  the 
array  object,  task_status,  as  a  data  structure  to  store  termination  information 
about  each  minitask. 

After  this  object  declaration  we  declare  the  array  of  minitasks.  These  tasks 
begin  their  execution  once  the  begin  statement  of  the  package  body  is  reached. 
We  shall  see  later  that  this  declaration  may  be  more  effective  if  it  is  placed 
elsewhere. 

The  next  item  is  the  instantiation  of  the  gen_beacon  package  which  is  used 
to  to  distribute  task  identities  to  each  minitask  in  parallel.  We  supply  the  first 
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package  body  gen.minitask  is 
task  type  minltask; 

type  task^tatus.record  is  record 

completed  :  boolean  :=  true; 

exception_kind  :  exception-type  :=  None; 
end  record; 

—  The  data  structiire  to  hold  the  each  minitaslc's  status: 
task_status  :  Eirray(iteration_range)  of  task.status.record; 
loop.tasks  :  array(iteration_range)  of  mlnitask; 

—  Create  a  fetch.and.add  variable  for  minitask  initialization. 
package  iteration  .counter  is  new 

gen.beacon(integer(iteration.range 'first)); 

task  body  minitask  is  . .  .  (See  Figure  4.9) 

function  task.completion  is  .  . .  (See  Figure  4.11) 

function  task.exception  is  . . .  (See  Figure  4.11) 
begin 

null; 
end  gen. minitask; 

Figure  4.8:  The  First  Version  of  Package  Body  of  Gen. Minitask 
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task  body  minitask  is 

iterate  :  iteration.range; 

procedure  record-exception(exception. raised  :  exception. type)  is 
begin 

task_status(iterate). exception. kind  :—  exception.raised; 

task_status(iterate). completed  :=  false; 

success  :=  false; 
end  record.exception; 

begin 

—  Acquire  an  iterate  &  increment  counter  atomically. 
iterate  :=  iteration_range(iteration_counter.faa(l)); 
loop  J)ody(  iterate); 

exception       —  Any  imhcindled  exception  raised 
—  in  loop  .body  is  caught  here. 
when  CONSTRAINT_ERROR  =>  record.exception(Constraint); 
when  NUMERIC_ERROR  =>  record.exception(Numeric); 
when  PROGRAM_ERROR  =>  record.exception(Program); 
when  STORAGE j;rror  =>  record.exception(Storage); 
when  TASKING_ERR0R  =>  record.exception(Tasking); 
when  others  =>  record.exception(Other); 
end  minitask; 

Figure  4.9:  The  Body  of  a  Minitask 

element  in  the  iteration  .range  parameter  as  the  initial  value  to  the  underlying 
fetch.and.add  variable.  Next,  the  bodies  of  the  minitask,  and  the  two  status 
reporting  functions  are  specified. 

The  body  of  the  minitask  is  given  in  Figure  4.9.  A  local  variable  of  type 
iteration.range  is  declared  to  hold  the  iteration  value  this  ta^k  should  execute. 
The  only  other  item  in  the  declarative  part  of  this  task  is  a  local  procedure. 
This  procedure  is  used  solely  for  convenience  and  is  discussed  when  we  deal 
with  exceptions. 

Two  statements  comprise  the  sequence  of  statements  part  of  the  minitask 
body.     The  first  statement  acquires  a  unique  iterate  by  using  an  instance  of 
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gen_beacon  package,  iteration.counter.  Recall  that  this  instantiation  provides 
a  monitored  variable  onto  which  fetch.and-add  operations  can  be  performed 
efficiently  in  parallel.  As  each  minitask  calls  the  faa  procedure  with  an  increment 
of  one,  a  unique  identity  is  returned  to  it.  This  identity  is  used  as  the  iterate 
for  that  minitask;  it  is  passed  as  the  parameter  to  the  procedure  parameter  of 
the  generic  package. "* 

In  this  way  an  iteration  of  the  parallel  loop  is  executed.  As  our  implemen- 
tation of  an  instantiation  of  gen.beacon  allows  multiple  faa  ccdls  to  proceed  in 
parallel,  ail  minitasks  execute  fully  in  parallel;  no  bottleneck  is  present.  Further- 
more, the  semantics  of  a  parallel  loop  are  realized.  This  satisfies  constraint  3  of 
Table  4.1. 

Note  that  the  value  returned  from  the  faa  function  is  converted  to  the  sub- 
type iteration  range  before  it  is  assigned  to  iterate.  This  is  done  to  ensure  type 
correctness  with  the  loccil  variable  iterate  and  the  formal  parameter  of  the  pro- 
cedure parameter.  We  show  how  the  range  checking  associated  with  this  type 
conversion  can  be  eliminated  when  we  discuss  gen. minitask  implementation. 

Although  minitasks  are  not  visible  outside  of  their  package,  their  execution 
can  still  be  controlled.  The  procedure  parameter  corresponding  to  loopJ)ody 
may  access  any  visible  object,  cdlowing  minitasks  to  collaborate  on  a  computa- 
tion. Furthermore,  each  minitask  may  decide  to  abandon  its  execution  by  simply 
exiting  from  its  procedure.  In  this  manner,  one  minitask  may  signal  a  whole  fam- 
ily of  minitasks  to  abandon  their  execution  by  setting  a  boolean  variable  that  is 
visible  to  all  minitasks. 


*Note  that  the  value  tetuined  to  loop.taskt(i)  is  not  necessarily  i. 
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Handling  Exceptions  in  a  Minitask 

This  section  describes  how  exceptions  are  handled  during  the  execution  of  a  mini- 
task.  Of  primary  importance  is  where  the  exception  is  raised.  If  the  exception 
is  raised  in  the  body  of  the  minitask,  it  is  handled  by  the  appropriate  case  of  the 
exception  handler  located  in  the  minitask  itself.  This  handler  records  the  fact 
that  an  exception  was  raised  by  setting  the  success  variable  to  false  and  marking 
the  appropriate  record  element  of  the  task  status  array.  Since  all  cases  in  the 
exception  handler  are  similar,  they  combined  into  a  procedure,  record.exception. 
If  an  exception  is  raised  while  executing  the  loop.body  procedure,  this  pro- 
cedure has  the  capability  to  catch  this  exception  and  continue  processing.  Since 
this  procedure  is  provided  by  the  programmer,  it  enables  the  programmer  to 
specify  the  semantics  of  an  iteration  should  an  exception  be  raised.  Possible 
scenarios  include: 

1.  abandon  executing  this  iteration  by  catching  the  exception  and  returning 
from  the  procedure, 

2.  continue  execution  of  this  iteration  by  catching  it  and  execute  alternative 
instructions, 

3.  abandon  execution  of  this  iteration  and  signal  (via  a  shared  variable)  to 
aJl  other  minitasks  to  abandon  their  execution, 

4.  do  not  catch  the  exception  and  let  it  propagate  to  the  minitask  body. 

If  the  last  option  is  used  (the  default  when  no  exception  handler  is  specified) 
the  exception  is  caught  by  the  minitask's  exception  handler.  The  handler  records 
this  exception  in  the  same  manner  as  specified  above. 
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The  Placement  of  the  Minitask  Array  Object 

Consider  the  placement  of  the  declaration  of  the  array  of  minitasks.  At  first 
glance,  it  seems  naturcd  to  declare  this  array  in  the  declarative  part  of  the 
gen_minitask  package  body  as  was  done  in  Figure  4.8.  Under  this  scenario,  the 
tasks  begin  their  execution  once  the  begin  statement  of  the  package  body  is 
reached. 

However,  we  must  consider  when  these  minitasks  would  terminate.  The  ref- 
erence manual  specifies  that 

"the  instance  of  a  generic  package  is  a  package."  ARM  12.3(5) 

Thus,  an  instance  of  the  gen.minitask  package  is  the  same  as  declaring  this 
package  in  the  declarative  region  where  the  mstantiation  occurs.  Unlike  subpro- 
grams, packages  cannot  be  a  master  of  any  task.  The  master  of  a  task  that  is 
declared  inside  a  package  body  is  the  nearest  enclosing  task,  block  statement, 
subprogram,  or  library  package  (ARM  9.4(1)).  Since  an  instance  of  a  generic 
package  is  not  a  library  package,  if  we  declared  the  minitasks  in  the  declarative 
region  of  the  gen.minitask  package  body,  their  master  would  be  in  the  tnsianti- 
aiing  tcisk,  subprogram,  or  block  statement. 

While  this  in  itself  is  not  undesirable,  the  termination  semantics  it  implies 
are.  Consider  the  instantiation  of  the  gen.minitask  package.  At  instantiation 
time  all  minitask  will  begin  to  execute.  However,  when  can  the  instantiating  task 
be  assured  that  zdl  minitasks  have  terminated?  The  answer  is  clear,  although 
undesirable.  All  minitasks  must  terminate  upon  the  completion  of  the  sequence 
of  statements  corresponding  to  the  declarative  region  where  the  instantiation 
occurred.    Thus,  in  the  sequence  of  statement  section  no  guarantee  about  the 


CHAPTER  4.    THE  GENMINITASK  PACKAGE  86 

termination  of  these  tasks  can  be  made. 

If  we  place  the  gen_minitask  instantiation  in  its  own  block  statement,  we  can 
guarantee  that  all  minitasks  terminate  before  executing  any  other  statements. 
However,  consider  the  case  where  the  instantiating  task  wishes  to  discover  the 
completion  status  of  the  minitask  family,  a  common  occurrence.  Normally  this 
woiild  be  achieved  by  either  inspecting  the  success  variable  or  by  calUng  one  of 
the  two  status  reporting  functions.  Since  these  activities  can  not  be  performed 
until  we  are  assured  that  the  minitask  family  has  completed  its  execution,  we 
must  wait  until  after  the  end  statement  of  the  proposed  block  statement.  How- 
ever, since  the  instance  of  the  gen_minitask  occurs  in  the  declarative  region  of 
this  block,  the  boolean  variable  and  two  functions  are  no  longer  visible  once  this 
block  is  exited. 

Thus,  this  solution  is  not  effective.  By  placing  the  instantiation  inside  the 
block  statement  we  ensure  that  the  minitask  family  terminates,  but  at  the  point 
where  this  termination  takes  place  the  information  we  desire  is  no  longer  avail- 
able. 

A  solution  that  solves  this  problem  is  given  in  Figure  4.10.  It  involves  the 
placement  of  the  array  declaration  of  minitasks.  By  placing  this  zirray  object  in 
a  block  statement  inside  of  the  genjninitask  package  we  achieve  the  best  of  both 
worlds;  we  Ccm  guarantee  to  an  instantiating  task  that  all  minitasks  will  terminate 
at  a  point  where  visibihty  to  the  boolean  and  two  functions  is  maintained. 

By  placing  the  array  object  inside  a  block  statement  we  have  effectively 
changed  the  master  of  these  tasks  from  the  instantiating  task  to  the  block  state- 
ment itself.  Thus,  this  block  cannot  be  exited  until  till  minitasks  have  terminated. 
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package  body  gen_minitask  Is 
task  type  minitask; 

type  task_status_record  is  record 

completed  :  boolean  :=  true; 

exception _kind  :  exception.type  :—  None; 
end  record; 

—  The  data  structure  to  hold  the  each  minitask's  status: 
task.status  :  array(iteration_range)  of  task_status.record; 

Create  a  fetch. and_add  variable  for  minitask  initialization. 

package  iteration_counter  is  new 

gen-beacon(integer(iteration_range'first)); 

task  body  minitask  is  . . .  (See  Figure  4.9) 

function  task.completion  is  . . .  (See  Figure  4.11) 
function  task.exception  is  .  . .  (See  Figure  4.11) 

begin 

declare  —  This  block  is  the  minitasks' master. 

loop_tasks  :  array(iteration  j-ange)  of  minitask; 
begin 

null; 
end; —  All  minitasks  terminate  before  this  block  is  exited. 
end  gen.minitask; 

Figure  4.10:  The  Improved  Version  of  Package  Body  of  genjninitask 
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function  task_completion(iterate  :  in  iterationj-ange)  return  boolean  is 
begin 

ret  urn(task_status(iterate). completed); 
end  task-completion; 

function  task.exception(iterate  :  in  iteration.range)  return  exception.type  is 
begin 

ret  urn(  t  ask  _st  a  tus(  iterate). except  ion  .kind); 
end  task_exception; 

Figure  4.11:  The  Task. Completion  and  Task.Exception  Functions 

In  turn,  the  sequence  of  statements  of  the  gen.minitask  body  cannot  be  completed 
until  this  block  statement  is  completed.  Therefore,  control  is  not  returned  to  the 
instantiating  task  until  all  minitasks  have  terminated.  In  addition,  the  sequence 
of  statements  that  corresponds  to  the  declarative  region  that  instantiated  the 
gen_minitask  package  retains  visibility  to  the  completion  status  information. 

4.5.3      The  Completion  Status  Routines 

In  addition  to  creating  the  appropriate  number  of  paraJlel  threads,  an  instance 
of  the  gen.minitask  package  provides  a  boolean  variable  and  two  functions  to 
help  determine  the  status  of  a  minitask  family.  We  have  already  seen  how  the 
success  variable  is  manipulated.  In  Figure  4.11,  we  specify  the  two  functions 
task.completion  and  task.exception.  We  shall  show  in  Chapter  5  how  a  call  to 
these  functions  corresponds  to  a  straightforward  access  to  the  task-Status  data 
structure  to  determine  the  behavior  of  a  particular  iterate. 


Chapter  5 


GenMinitask  Implementation 


This  chapter  describes  the  manner  in  which  an  instantiation  of  the  gen_minitask 
generic  package  is  implemented.  Our  goal  in  this  implementation  is  to  minimize 
common  initialization,  and  to  eliminate,  if  possible,  all  serial  operations  in  recJ- 
izing  a  "parallel  loop"  idiom.  Ada  semantics  force  the  seriedization  of  a  number 
of  operations,  some  of  which  are  not  obvious.  Through  the  use  of  several  opti- 
mizations, we  attempt  to  either  remove  these  operations  edl  together,  or  perform 
them  in  parallel,  while  still  upholding  Ada  semantics. 

The  implementation  we  describe  is  directly  related  to  the  usefulness  of  our 
idiom  in  obtcuning  efficient  loop-level  paradlehsm.  To  obtciin  this  efficiency  we 
make  use  of  several  optimizations.  Underlying  these  optimizations  are  two  prin- 
ciples: 

1.  As  is  the  csise  with  the  gen.beacon  generic,  the  contents  of  the  gen_minitask 
generic  arc  known  at  compiler  writing  time.  This  information  is  synthe- 
sized into  our  implementation  to  allow  for  an  efficient  implementation. 
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2.  The  entities  in  the  package  body  are  not  accessible  outside  of  its  body. 
Therefore,  the  manner  in  which  these  entities  are  used  is  also  known 
at  compiler  writing  time.  Of  particular  interest,  minitasks  are  used  in 
a  light-weight  manner;  they  do  not  contciin  any  entry  points  and  cannot 
be  aborted  directly. 

To  determine  the  manner  in  which  an  instantiation  of  the  gen_minitask  generic 
package  is  elaborated,  we  consult  the  appropriate  section  in  the  reference  man- 
ual: 

"For  the  elaboration  of  a  generic  instantiation,  each  expression  sup- 
plied as  an  explicit  generic  actuzil  parameter  is  first  evaluated.    . . . 
Finally,  the  implicitly  generated  instance  is  elaborated." 
ARM    12.3(17) 

Since  the  parameters  of  our  generic  are  a  type  and  a  procedure,  no  expressions 
need  be  evaluated.^  An  elaboration  of  an  instantiation  corresponds  to  elaborat- 
ing the  gen.minitask  generic  package  with  the  actual  parameters  substituted  for 
the  formals.  This  elaboration  is  broken  into  three  phases:  the  elaboration  of  the 
package  specification,  the  elaboration  of  the  package  body,  and  the  execution  of 
the  sequence  of  statements  associated  with  the  package  body.  We  describe  each 
of  these  phases  in  the  following  three  sections. 


'At  compile  time,  checks  are  performed  to  ensure  that  the  actual  parameters  satisfy  the 
constraints  given  by  the  formals. 
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package  gen.minitask  is 

type  exception.type  is  (None,  Constraint,  Numeric,  Program, 

Storage,  Tasking,  Other); 

success  :  boolean  :—  true; 

function  task.completion(iterate  ;  in  iteration-range)  return  boolean; 

function  task_exception(iterate  ;  in  iterationj-ange)  return  exception.type; 
end  gen jninitask; 

Figure  5.1:  The  Package  Specification  of  Gen.Minitask 

5.1      Elaborating  the  Package  Specification 

Consider  the  gen.minitask  generic  package  specification  as  shown  in  Figure  5.1. 
Since  the  items  declcired  in  this  specification  are  visible  to  an  instantiating  block, 
they  are  be  treated  as  any  other  declarative  item:  the  optimizations  we  can 
perform  are  limited.  Therefore,  we  make  these  items  available  in  the  standard 
way  other  library  packages  (Calendar  emd  System,  for  example)  are  processed:  the 
compiler  is  initiadized  with  the  information  contained  in  these  packages  when  it  is 
created.  Any  program  that  instcintiates  the  gen_minitask  generic  gaiins  visibiUty 
to  the  appropriate  entities  declzured  in  the  specification  of  this  package. 

For  completeness,  we  review  the  semantics  of  the  elaboration  of  the  package 
specification.  We  shall  see  that  although  we  aie  Hmited  in  the  optimizations  that 
we  can  perform,  the  nin  time  overhead  involved  with  elaborating  these  items  is 
n\ininial. 

type  exception.type  is  . , . 

The  elaboration  of  this  eninnerated  tj-pe  is  specified  by  the  following: 

"The  elaboration  of  an  enumeration  type  definition  creates  an 
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enumeration  type;  this  elaboration  includes  that  of  every  enu- 
meration literal  specification."  ARM  3.5.1(3) 

success  :   boolean  :=  true; 

The  elaboration  of  this  object  declaration  involves  the  follovdng  four  steps: 

1.  Elaborate  the  type,  boolean. 

2.  Evaluate  the  initial  value,  true. 

3.  Create  the  object,  success. 

4.  Assign  the  initial  value,  true,  to  success. 

Since  the  type  and  initial  value  are  predefined,  the  first  two  steps  do  not 
require  any  action.  However,  the  last  two  steps,  creating  the  object  and 
assigning  it  an  initial  value  do  need  to  be  performed  at  run  time. 

function  task_completion  ... 

function  task_exception  . . . 

These  two  function  declarations  are  elaborated  in  the  same  manner: 

"The  elaboration  of  a  subprogram  declaration  elaborates  the 
corresponding  formal  part.  The  elaboration  of  a  formal  part  has 
no  other  eflfect."  ARM  6.1(6) 

These  statements  specify  that  no  action  is  required  in  elaborating  these 
two  subprograjn  declarations  except  recording  the  fact  that  they  haVe  been 
elaborated. 
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In  summary,  elaborating  the  package  specification  requires  the  designating  of 
three  entities  (exception_type,  task_completion,  and  task_exception)  as  elaborated 
in  addition  to  aJlocating  a  boolean  variable  with  an  initial  value  of  true.  All  four 
of  these  entities  are  visible  to  the  instantiating  block. 

5.2      Elaborating  the  Declarative  Part 

Once  the  package  specification  is  elaborated,  we  can  begin  elaborating  the  declar- 
ative part  of  the  package  body.  In  contract  to  the  package  specification  the 
declarative  entities  in  the  package  body  are  not  visible  to  the  programmer;  they 
cannot  be  used  in  any  program  specified  by  a  user.  Hence,  our  implementation  of 
these  entities  does  not  necessarily  have  to  process  these  items  in  the  same  man- 
ner as  it  would  if  they  were  visible.  However,  the  elaboration  and  subsequent 
use  of  these  entities  must  still  uphold  proper  Ada  semantics. 

Figure  5.2  specifies  the  declarative  part  of  the  gen_minitask  generic  package 
body.  In  the  remainder  of  this  section,  we  consider  each  item,  specifying  the 
semantics  of  its  elaboration. 

task  type  minitask; 

A  task  type  is  an  example  of  a  task  specification, 

"For  the  elaboration  of  a  task  specification,  entry  declarations 
and  representation  clauses,  if  any,  are  elaborated  in  the  order 
given."  ARM    9.1(5) 

Sincr  th<-  minitask  specification  does  not  contain  any  entry  or  representa- 
tion clauses,  this  elaboration  requires  no  action. 
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package  body  gen_minltask  is 
task  type  minitask; 
type  task_status_record  is  record 
completed  :  boolean  :=  true; 
exception  .kind  :  exception.type  :=  None; 
end  record; 

task_status  :  array(iteration_range)  of  task.status_record; 
package  iteration_counter  is  new 

gen  _beacon(tnteger(  iteration  .range 'first)); 
task  body  minitask  is  .  . . 
function  task. completion  is  . . . 
function  task.exception  is  . . . 
begin 

end  gen.minitask; 
Figure  5.2:  The  Declarative  Part  of  Gen. Minitask  Body 

type  task_status_record  is  record  ... 

task_status_record  is  a  record  type  definition.  The  semantics  of  its  elabo- 
ration are  given  by  the  following: 

"The  elaboration  of  a  record  type  definition  creates  a  record 
type;  it  consists  of  the  elaboration  of  any  corresponding  (sin- 
gle) component  declarations,  in  the  order  in  which  they  appear, 
including  any  component  declaration  in  a  variant  part."  ARM 
3.7(7) 

This  quote  specifies  that  elaborating  this  record  type  creates  that  type. 
Since  this  type  can  only  be  used  within  this  package,  we  do  not  need  to 
actually  create  it.  However,  when  we  subsequently  use  this  type  we  do 
need  to  recognize  it  as  a  created  type  with  the  components  and  default 
vaJues  that  are  specified. 
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As  is  the  case  with  other  types,  a  type  template  is  associated  with  the  type 
task-status.record.  Since  this  type  is  static,  its  type  template  is  generated 
at  compile  time.  This  template  is  used  when  an  object  of  this  type  is 
created  or  accessed. 

task_status  :  array(iteration_range)  of  task_status_record; 

This  item  is  an  object  declaration  that  utilizes  the  previous  type  definition. 

"The  elaboration  of  an  object  declaration  proceeds  as  follows: 

1.  The  subtype  indication  or  constrained  array  is  first  elabo- 
rated. This  establishes  the  subtype  of  the  object. 

2.  If  the  object  declaration  includes  an  explicit  initialization, 
the  initial  value  is  obtained  by  evaluating  the  correspond- 
ing expression.  Otherwise  any  impHcit  initial  values  for  the 
object  or  for  its  subcomponents  are  evaluated. 

3.  The  object  is  created. 

4.  Any  initial  value  is  assigned  to  the  object  or  to  the  corre- 
sponding subcomponent."  ARM  3.2.1(4-8) 

The  first  step  specifies  that  the  array  is  first  elaborated.  To  determine  the 
manner  in  which  this  is  done,  we  consider  the  following: 

"The  elaboration  of  a  constrzuned  array  definition  creates  the 
corresponding  array  type  and  array  subtype.  For  this  elabora- 
tion, the  index  constrjiint  and  the  component  subtype  indication 
arc  elaborated."  ARM  3.6(10) 
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From  this  section  we  see  that  this  elaboration  corresponds  to  creating  the 
array  type,  and  elaborating  the  index  constraint  and  the  array  subtype. 
As  is  the  case  with  other  types,  actually  creating  the  type  can  be  omitted; 
this  type  is  not  to  be  used  cinywhere  else. 

Since  the  index  constraint  corresponds  to  the  type  parameter  to  the  generic, 
it  already  has  been  evaluated  in  order  to  satisfy  the  parameter  matching 
rules.  Similarly,  the  component  subtype,  task_status. record,  has  also  been 
elaborated.  Thus,  the  first  part  of  the  elaboration  of  the  array  object  does 
not  require  any  action. 

Turning  to  the  second  step,  we  note  that  no  explicit  initialization  of  the 
array  is  specified.  However,  the  record  type  does  specify  an  initial  value 
for  both  of  its  subcomponents.  Since  these  values  are  known  constants 
their  run  time  evaluation  is  not  required. 

The  last  two  steps  specified  for  the  elaboration  of  the  array  object  re- 
quire run  time  action.  The  object  must  be  created  and  its  subcomponents 
assigned  their  initial  values.  To  comply  with  this  requirement,  our  im- 
plementation allocates  a  block  of  storage  of  2N  units  in  shared  memory, 
where  N  —  task.range'last  -  task.range'first  +  1.  It  should  be  noted  that  we 
could  pack  the  boolean  and  the  enumerated  type  into  four  bits,  allowing 
for  a  total  allocation  of  4A'^  bits. 

Once  this  storage  has  been  allocated,  it  would  normally  be  initialized; 
each  of  the  completed  fields  would  be  set  to  true,  while  each  of  the  excep- 
tion_kind  fields  would  be  set  to  None.  This  initialization  would  complete 
the  elaboration  of  this  object  declaration. 
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However,  although  the  time  to  perform  each  initialization  is  small,  N  ini- 
tializations are  required.  As  it  is  our  goal  to  reduce  run  time  overhead, 
an  action  that  is  dependent  on  the  size  of  TV  is  undesirable.  We  avoid  this 
serialization  point  by  postponing  the  initialization  until  the  correspond- 
ing niinitask  is  created.  Upon  being  created  each  minitask  initializes  its 
element  in  the  task_status  array.  These  initicJizations  are  performed  in 
parallel. 

package  iteration_counter  is  new  gen_beacon(. .  .) 

This  item  is  an  instantiation  of  the  gen_beacon  package.  Elaborating  this 
instantiation  involves  extracting  the  'first  value  from  the  iteration.range 
type,  converting  it  to  an  integer,  and  using  it  for  the  gen  Jjeacon  instanti- 
ation. As  described  in  Chapter  3,  instantiating  this  package  corresponds 
to  allocating  an  integer  variable  and  assigning  it  the  initial  value  that  is 
passed  as  its  parameter.  It  also  makes  visible  the  three  monitoring  sub- 
programs: read,  write,  and  faa.  Section  5.5  describes  how  all  these  actions 
are  performed  implicitly  and  therefore,  have  zero  run  time  cost. 

task  body  minitask  is  . . . 

The  elaboration  of  a  task  body  "has  no  other  effect  than  to  establish  that 
the  body  can  from  then  on  be  used  for  the  execution  of  tasks  designated 
by  objects  of  the  corresponding  taisk  type"  (ARM  9.1(5)). 

This  can  be  accomplished  by  setting  a  "callable"  boolean  variable  associ- 
ated with  this  task  to  true.  However,  by  inspecting  the  the  gen  minitask 
package,  we  can  conclude  that  the  minitask  body  is  always  elaborated  be- 
fore any  of  its  uses  arc  encountered.  Therefore,  no  boolean  variable  or  run 
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time  action  is  required. 

function  task_completion  is  . . . 

function  task_exception  is  . . . 

Subprogram  bodies  are  elaborated  in  the  same  manner  as  task  body  as 
described  in  the  following: 

"The  elaboration  of  a  subprogram  body  has  no  other  effect  than 
to  establish  that  the  body  can  from  then  on  be  used  for  the 
execution  of  ceills  of  the  subprogram."  ARM  6.3(5) 

As  in  the  case  with  the  minitask  body,  a  "callable"  boolean  variable  can  be 
used  to  mark  the  elaboration  of  a  function  body.  However,  our  implemen- 
tation need  not  maintain  this  variable;  these  functions  are  not  used  within 
the  gen_minitask  generic  and  any  use  by  the  instantiating  task  must  follow 
the  instantiation  itself.  Since  these  function  bodies  are  elaborated  during 
this  instantiation,  no  boolean  variable  or  run  time  action  is  needed. 

Table  5.1  summarizes  the  actions  that  we  perform  for  the  elaboration  of  the 
each  item  in  the  declarative  part  of  the  genjtiinitask  package. 

5.3      Executing  the  Sequence  of  Statements 

After  the  elaboration  of  the  package  specification  and  the  declarative  part  of  the 
package  body,  the  sequence  of  statements  associated  with  the  package  body  is 
executed.   Once  these  statements  have  been  executed,  control  is  returned  back 
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Declarative  Item 

Visible? 

Implementation  Action! 

type  exception.type  is  . . . 

Yes 

No  Action  Required. 

success  :  boolean  :=  true; 

Yes 

Allocate  a  boolecin  variable. 
Initialize  it  to  true. 

function  task  .completion  ... 

Yes 

No  action  required. 

function  task.exception  . . . 

Yes 

No  action  required. 

task  type  minitask; 

No 

No  action  required. 

type  task^statusj-ecord  is  . . . 

No 

No  action  required. 

task^status  :  array(iterationj'ange) 
of  task.statusj-ecord; 

No 

Allocate  27V  storage  iinits. 

package  iteration-counter  is  . . . 

No 

Allocate  one  storage  unit. J 
Set  it  to  iteration j-ange'first. 
Make  read,  write,  and  faa  visible. 

task  body  minitask  is  . . . 

No 

No  action  required. 

function  task.completion  is  . . . 

No 

No  action  required. 

function  task.exception  is  ... 

No 

No  action  required. 

fWhen  the  compiler  is  created,  it  is  initicJized  with  the  first  four  items. 
JSection  5.5  describes  a  method  that  eliminates  these  actions. 


Table  5.1:  The  Elaboration  of  the  Declarative  Part  of  the  Gen. Minitask 
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begin 

declare 

loop.tasks  :  array(iteration-range)  of  minitask; 
begin 

null; 
end; 
end  gen_minitask; 

Figure  5.3:  The  Sequence  of  Statements  of  the  Gen_Minitask  Body 

to  the  instantiating  block.    Figure  5.3  specifies  the  sequence  of  statements  for 
the  body  of  the  gen_minitask  package. 

As  mentioned  in  Section  4.5.2,  a  block  statement  is  the  only  statement  con- 
tained in  the  sequence  of  statements  section  of  the  genjninitask  body.  A  block 
statement  is  executed  in  a  straightforward  manner: 

"The  execution  of  a  block  statement  consists  of  the  elaboration  of  its 
declarative  part  (if  axiy)  followed  by  the  execution  of  the  sequence  of 
statements."  ARM  5.6(4) 

The  declarative  part  of  the  block  statement  contciins  only  one  declarative 
item,  an  array  of  minitasks.  When  this  item  is  elaborated,  the  array  and  the 
appropriate  number  of  tasks  are  created.  Each  element  of  the  array  is  assigned 
an  implicit  initial  value  designating  the  corresponding  task  (ARM  3.2.1(11)). 

After  the  array  has  been  successfully  elaborated,  the  elaboration  of  declara- 
tive part  is  complete.  Since  the  declarative  part  contains  taisk  objects  that  were 
created,  they  are  activated  upon  reaching  the  begin  statement.  From  then  on  all 
minitasks  execute  in  pcirallel.  Before  exiting  the  sequence  of  statements  section 
of  the  block  statement,  the  task  that  executed  this  block,  the  instzintiating  task, 
must  wait  for  all  minitasks  to  terminate.    At  this  point,  control  is  returned  to 


CHAPTER  5.    GENJdINITASK  IMPLEMENTATION  101 

the  instantiating  declarative  part. 

Now  that  we  have  specified  the  semantics  of  the  block  statement,  we  consider 
how  to  obtain  an  efficient  implementation  that  satisfies  these  semantics. 

In  Section  4.2,  two  deficiencies  in  the  Ada  tasking  model  were  specified  that 
prevent  the  realization  of  efficient  loop-level  paraileHsm.  In  summary,  they  are: 

•  The  synchronization  point  required  between  task  creation  and  task  acti- 
vation. 

•  The  storage  overhead  used  to  maintain  a  task's  status. 

The  body  of  the  genjninitask  package  is  constructed  in  such  a  way  that 
optimizations  may  be  performed  to  overcome  these  two  deficiencies.  Note  the 
following  properties  about  the  block  statement: 

1.  Once  all  minitasks  have  been  created,  they  are  immediately  activated;  there 
are  no  declarative  items  between  the  creation  of  these  tasks  and  their 
activation. 

2.  Minitasks  do  not  possess  any  entry  points,  nor  can  they  be  aborted  directly. 
In  addition,  no  priority  is  specified  for  the  minitask. 

In  the  next  two  sections,  we  considered  the  aforementioned  deficiencies.  We 
siiow  how  the  properties  mentioned  in  the  previous  paragraph  are  instrumental 
in  removing  these  obstacles  (as  described  in  Section  4.2)  to  efficient  loop-level 
parallelism. 
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5.4      Removing  the  Synchronization  Point 

This  section  describes  how  the  two  phase  process  of  task  creation  and  task  acti- 
vation can  be  combined  into  one  phase.  The  optimization  we  describe  removes 
the  synchronization  point  between  the  time  when  a  task  is  created  and  when  it 
is  activated.  This  optimization  uses  the  first  property,  that  the  behavior  of  the 
declarative  part  that  contziins  the  task  object  is  known. 

Recall  the  origin  of  this  synchronization  point  (see  Section  4.2):  a  statically 
declared  task  object  is  not  activated  until  all  declarative  items  that  foUow  it  in 
the  same  declarative  part  have  been  elaborated  successfully.  If  the  elaboration  of 
any  of  these  items  raises  an  exception,  then  the  task  is  not  activated.  We  provide 
an  implementation  of  the  block  statement  that  removes  the  synchronization 
point,  while  still  satisfying  proper  Ada  semantics.  We  begin  with  two  lemmas. 

Lemma  5.1  STORAGEJERROR  is  the  only  exception  that  can  be  raised  during 
the  elaboration  of  loop.tasks. 

Proof  The  elaboration  of  an  object  declaration  proceeds  in  the  four  steps 
described  in  Table  5.2.  We  consider  each  step,  in  turn,  examing  what  exceptions 
might  be  raised. 

Step  1:  Since  the  object  declaration  we  are  considering  is  a  constrained  array, 
we  consider  the  manner  in  which  this  entity  is  elaborated.  Section  3.6(10) 
of  the  reference  manual  specifies  that  in  addition  to  creating  the  array 
type  and  array  subtype,  the  index  constraint  and  the  component  subtype 
indication  are  elaborated. 
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The  elaboration  of  an  index  constraint  consists  of  evaluating  the  discrete 
range  which  can  potentially  raise  NUMERIC_ERROR  (arm  3.6.1(11)).  How- 
ever, since  the  range  itself  is  a  type  definition  that  has  already  been  suc- 
cessfully elaborated  as  a  generic  parameter,  we  are  assured  that  when 
control  reaches  this  point  of  the  text,  this  evzJuation  does  not  raiise  NU- 
MERIC_ERROR.  Elaborating  the  component  subtype,  minitask,  creates  that 
subtype. 

Summarizing  this  step,  we  see  that  the  only  exception  that  can  be  reused 
is  STORAGE_ERROR.  Due  to  insufficient  storage  for  the  creation  of  any  of 
the  objects  mentioned.  (ARM    11.1(8)). 

Step  2:  Since  no  explicit  initialization  is  specified  (or  allowed)  for  the  loop.tasks 
declaration,  the  first  part  of  this  step  does  not  apply.  The  implicit  initial 
Vcdue  for  a  task  designates  a  task.  Since  storage  is  allocated  when  a  task 
is  created,  STORAGE_ERROR  may  be  raised  in  this  step. 

Step  3:  The  creation  of  the  array  object,  loop.tasks,  can  also  raise  STORAGE.- 
ERROR.  No  other  exception  caji  be  raiised  by  this  creation. 

Step  4:  Assigning  the  initial  value  of  each  task  to  the  appropriate  components 
of  the  array  cannot  raise  any  exception. 

In  each  of  the  four  steps,  no  other  exception  besides  STORAGE-ERROR  can 
be  rciiscd.  Therefore,  STORAGEJERROR  is  the  only  exception  that  can  be  raised 
during  the  elaboration  of  the  loop.tasks.  ^ 
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1.  "The  subtype  indication  or  the  constrained  array  definition  is  first  elaborated." 

2.  "If  the  object  declaration  includes  an  explicit  initiahzation,  the  initial  Vcilue  is 
obtained  by  evaluating  the  corresponding  expression.  Otherwise  any  implicit 
initial  vzJues  for  the  object  or  for  its  subcomponents  are  evaduated." 

3.  "The  object  is  created." 

4.  "Any  initial  value  is  assigned  to  the  object." 

Table  5.2:  The  Four  Steps  Involved  in  Elaborating  an  Object  Declaration 

Lemma  5.2  //  no  exception  is  raised  during  the  elaboration  of  loop.tasks  in 
Figure  5.S,  then  all  minitask^  can  be  activated  immediately  following  their  cre- 
ation. 

Proof  Assume  no  exception  is  raised  during  the  elaboration  of  loop.tasks. 
During  this  elaboration  all  minitasks  are  created.  As  there  are  no  other  items 
between  this  declaration  and  the  begin  statement,  no  exception  can  be  raised 
after  this  item  is  elaborated.  By  (ARM  9.3(2)),  once  the  declarative  items  of 
the  block  statement  have  been  elaborated,  the  activation  of  all  tasks  declared 
in  the  corresponding  declarative  part  coramences.  Hence,  once  all  minitasks  are 
created,  they  can  be  activated.  ^ 

These  two  lemmas  ahovi  us  to  prove  the  following  theorem.  This  theorem 
forms  the  basis  for  our  implementation  of  the  block  statement  and  allows  us  to 
eliminate  the  synchronization  point. 

Theorem  5.1  If  STORAGE_ERR0R  is  not  raised  during  the  elaboration  of  loop_- 
tasks,  then  the  creation  and  activation  of  minitasks  can  be  combined,  and  per- 
formed in  parallel. 
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Proof  Assume  STORAGE_ERROR  is  not  raised  during  the  elaboration  of  loop.- 
tasks.  By  Lemma  5.1,  we  conclude  that  no  exception  is  raised  during  this  elab- 
oration. By  lemma  5.2,  cdl  minitasks  are  activated  immediately  following  this 
elaboration.  Thus,  there  is  no  need  for  a  synchronization  point  between  the 
creation  and  activation  of  the  minitasks;  this  activation  can  be  combined  with 
the  elaboration  of  loop.tasks.  ^ 

Before  we  discuss  how  our  implementation  utilizes  this  theorem,  we  note  the 
existence  of  a  second  synchronization  point  associated  with  task  initiation.  This 
point  occurs  when  the  last  task  that  has  completed  its  activation  unblocks  its 
parent  task.  At  this  point  the  parent  can  begin  executing  the  statements  that 
follow  the  begin  statement  that  prompted  this  activation. 

In  our  implementation  of  the  gen.minitask  idiom,  we  are  able  to  ignore  this 
synchronization  point.  The  only  statement  after  the  inner  begin  statement  of 
Figure  5.3  is  a  null  statement.  Executing  a  null  statement  has  no  effect  (aRM 
5.1(5)),  so  the  parent  task  is  blocked  until  all  of  its  dependent  tasks  have  termi- 
nated. Thus,  our  implementation  does  not  awaken  the  parent  task  until  all  of 
its  dependent  tasks  have  terminated.  Although  a  synchronization  point  is  not 
implemented,  we  nevertheless  uphold  proper  Ada  semantics. 

We  now  consider  how  to  remove  the  synchronization  point  referred  to  in 
Theorem  5.1.  This  theorem  states  that  if  we  can  determine  that  the  elaboration 
of  loop.tasks  does  not  raise  STORAGE-ERROR,  then  task  creation  and  activation 
can  be  combined.  While  one  cannot  guarantee  that  STORAGE-ERROR  will  never 
be  raiised,  we  can  tailor  our  implementation  to  teikc  advantage  of  this  theorem; 
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we  attempt  to  ascerttiin  if  an  elaboration  of  loop.tasks  is  going  to  raise  STOR- 
AGE_ERROR  before  we  begin  the  elaboration. 

Consider  how  STORAGE-ERROR  could  be  raised  during  the  elaboration  of 
loop.tasks.  As  was  noted  in  the  proof  of  lemma  5.1,  storage  is  requested  at  several 
places  throughout  this  elaboration;  storage  is  also  allocated  for  the  array  and  to 
manage  the  new  task's  status  (the  task  control  block)  and  to  store  values  (its 
stack).  In  our  implementation,  these  requests  can  combined.  If  this  combined 
request  fails,  STORAGE-ERROR  is  raised.  However,  we  would  like  to  determine 
this  before  we  begin  the  elaboration. 

If  we  implement  this  elaboration  in  a  straightforward  manner,  we  will  not  be 
able  to  determine  if  STORAGE_ERROR  is  raised  until  after  all  the  storage  requests 
are  made,  i.e.  after  task  creation  is  performed.  Thus,  under  this  straightforward 
implementation  a  synchronization  point  exists;  task  creation  cannot  be  combined 
with  task  activation.  The  next  two  sections  describe  methods  in  which  the 
synchronization  point  can  be  removed. 

5.4.1      One-Shot  Allocation  of  Stacks 

One  implementation  approach  is  have  the  parent  task  attempt  to  allocate  a 
large  chunk  of  storage  for  all  minitasks.  This  approach  takes  advantage  of  the 
fact  that  the  amount  of  storage  needed  for  a  minitask  instantiation  is  a  function 
of  the  number  of  minitasks  that  are  to  be  created.  Since  this  number  is  known 
before  the  elaboration  of  loop.tasks,  the  total  amount  of  storage  to  be  requested 
by  this  elaboration  is  known.  Therefore,  this  storage  cam  be  requested  by  the 
instantiating  task  before  the  elaboration  of  loop.tasks  commences. 
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If  this  storage  request  is  successful,  we  know  that  STORAGE_ERROR  will  not 
be  raised  for  this  elaboration.  By  applying  Theorem  5.1,  we  can  combine  ta^k 
creation  with  task  activation.  Each  minitask  could  then  use  its  identity  to  index 
into  the  pool  of  storage  to  obtain  its  own  section. 

Although  this  approach  does  remove  the  synchronization  point  associated 
with  task  initiation,  there  is  one  major  drawback:  since  the  processor  that 
allocates  the  memory  is  not  necessarily  the  same  one  that  uses  it,  this  chunk  of 
storage  must  be  allocated  in  shared  memory.  Consequently,  all  stack  operations 
that  a  task  undertakes  would  need  to  suffer  the  performance  penalty  of  a  shared 
memory  access.  Since  accessing  shared  memory  on  a  machine  such  as  the  rp3 
can  take  up  to  ten  times  as  long  as  an  access  to  local  memory,  this  approach  is 
rejected. 

5.4.2      Stack  Recycling 

In  this  section,  we  suggest  an  alternative  stack  allocation  scheme,  called  stack 
recycling,  which  satisfies  the  condition  needed  to  apply  Theorem  5.1. 

Under  the  stack  recycling  scheme,  each  processor  has  associated  with  it  a 
stack  which  is  used  exclusively  for  the  execution  of  minitasks.^  This  "minitask 
stack"  is  allocated  by  a  processor  as  part  of  that  processor's  initial  execution, 
before  it  attempts  to  execute  any  tzisks.  Once  this  allocation  has  been  performed, 
a  processor  can  execute  a  minitask  using  its  locaJ  minitask  stack.  When  the 
minitask  terminates,  this  stack  is  recycled  by  using  it  for  the  next  minitask  that 
is  executed  by  this  processor. 


'Recall  that  a  "stack"  ulso  includes  a  tcb. 
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Since  all  stack  allocations  are  made  before  a  program  begins  its  execution, 
there  is  no  need  to  request  storage  for  this  purpose  during  the  instantiation 
of  the  gen-minitask  package.  AU  other  storage  requests  are  performed  by  the 
instantiating  task  before  any  minitasks  are  enqueued.^  Once  these  requests  have 
been  satisfied,  we  are  assured  that  STORAGE-ERROR  will  not  be  raiised  during 
the  elaboration  of  loop.tasks.  Applying  Theorem  5.1,  we  combine  tcisk  creation 
and  task  activation. 

In  addition  to  allowing  the  combination  of  task  creation  and  activation,  this 
scheme  also  offers  the  following  advantages: 

•  All  minitask  stacks  reside  in  the  local  memory  of  the  processor  that  is 
executing  it.  Hence,  all  stack  and  tcb  operations  are  local  accesses. 

•  The  total  number  of  stacks  used  for  the  execution  of  minitasks  in  an  ap- 
plication is  bounded  by  the  number  of  processors,  not  by  the  number  of 
minitasks. 

5.4.3      A  Critique  of  the  Stack  Recycling  Scheme 

In  the  previous  section  we  discussed  some  of  the  advantages  of  using  the  stack 
recycling  scheme.  In  this  section  we  critique  this  scheme,  discussing  the  potential 
disadvantages.  We  shall  show  that  these  disadvantages  are  either  insignificant 
or  can  be  overconae  by  enhancing  our  scheme.  The  two  disadvajitages  we  discuss 
both  rise  from  the  fact  that  minitask  stacks  are  allocated  on  a  per  processor  basis 
rather  than  one  per  minitask. 


^In  our  implementation,  the  initialization  of  the  task^tatus  airay  is  performed  by  each  mini- 
task  avoiding  a  potential  bottleneck. 
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The  first  potential  disadvantage  of  this  approach  is  that  if  a  minitask  blocks 
its  execution,  the  processor  it  is  executing  on  is  unable  to  execute  other  minitasks; 
the  minitask  stack  for  that  processor  is  already  in  use.  To  better  xinderstand  the 
severity  of  this  hmitation,  we  consider  the  circumstances  that  would  cause  a 
minitask  to  block. 

Consider  the  case  where  an  instantiation  of  the  gen_minitask  package  is  used 
strictly  as  a  parallel  loop  construct;  the  loop  body  contains  a  simple  sequence 
of  statements.  Used  in  this  fashion,  the  only  reason  for  a  minitask  to  block  is  if 
it  performs  l/O,''  an  event  that  we  expect  to  happen  infrequently. 

However,  since  the  body  of  a  minitask  is  specified  by  a  procedure  (loop.body), 
it  may  contain  other  instances  of  the  gen.minitask  generic  package  (see  Fig- 
ure 5.4).  As  a  consequence  of  this,  a  minitask  may  need  to  block  its  execution 
while  it  awaits  the  completion  of  the  activation  of  its  subtasks. 

However,  in  most  cases  this  nesting  of  tasks  will  create  more  tasks  than  there 
are  processors,  giving  the  appearance  of  increased  parallelism.  If  we  attempt  to 
execute  these  nested  tasks  in  parallel,  we  will  most  likely  reduce  program  per- 
formance. Therefore,  our  implementation  attempts  to  find  "useful  parallehsm" 
by  executing  nested  occurrences  of  minitasks  sequentially.^ 

This  approach  directly  corresponds  to  executing  the  outer  of  two  parallel 
loops  in  parallel,  while  the  inner  loop  is  executed  sequentially.  This  scheme 
reduces  the  number  of  fork  and  join  points  (from  A'  to  1 )  in  addition  to  increasing 
the  size  of  each  parallel  thread.  This  concept  of  reducing  parallelism  to  increase 
program  performance  has  been  called  chunking,  and  is  described  in  several  other 


*We  assumr  »chcduJcr  interference  was  a  minitask  begins  its  execution. 

*The  concept  of  "useful  parallelism"  was  first  mentioned  by  Cylron  in  [Cyt85]. 
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declare 

subtype  outer jange  is  integer  range  I..M; 

procedure  outerJoop_body  (outerJd  :  outer_range)  is 
subtype  inner. range  is  integer  range  1..N; 
procedure  innerJoop.body  (innerJd  :  inner.range)  is 
begin 

—  Instructions  nested  inside  of  both  "loops"  go  here. 
end  inner. loop. body; 

package  inner.minitask  is  new 

gen _minit ask  (inner-range, inner  JoopJjody); 

begin 

if  not  inner.minitask. success  then 

—  Handle  error. 
endif ; 

end  my. loop; 

package  outer_minitask  is  new 

gen.minitask  (outer j-ange.outerJoop-body); 
begin 

if  not  outer  jninitask.success  then 

—  Handle  error. 
endif  ; 
end; 

Figure  5.4:  Nested  Parallel  Loops  using  the  Gen.Minitask  idiom 
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contexts  [FH90,Cyt85,KW85]. 

An  alternative  scheme  would  be  to  postpone  the  decision  regarding  the  man- 
ner in  which  the  inner  loop  is  executed  until  run  time;  the  inner  loop  would  be 
executed  in  parallel  if  the  number  of  available  processors  is  sufficient  to  improve 
program  performance.  Otherwise,  the  inner  loop  would  be  executed  sequen- 
tially. This  scheme  is  similar  to  the  transformations  described  by  Byler,  et.  al. 
in  [BDH+87]. 

The  second  potential  disadvantage  of  this  scheme  is  the  extra  storage  that 
it  requires.  In  a  standard  implementation  of  Ada  tasks,  a  stack  is  allocated  for 
each  task.  In  our  approach  we  allocate  a  stack  for  each  processor,  regardless  of 
the  number  of  tasks.  While  this  approach  does  indeed  use  excess  storage  when 
the  number  of  tasks  in  a  program  are  less  than  the  number  of  processors,  our 
intended  domain  of  applications  will  almost  edways  have  at  least  as  many  tasks 
as  processors.  Under  this  more  likely  scenario,  our  approach  reduces  the  amount 
of  stack  storage  required. 

5.5      Reducing  Minitask  Overhead 

The  previous  section  described  a  mechanism  to  remove  the  synchronization  point 
that  exists  between  task  creation  and  task  activation.  In  this  section  we  con- 
sider the  overhead  that  is  associated  with  each  minitask,  and  describe  how  our 
implementation  reduces  it. 

Each  executing  or  blocked  task  in  an  Ada  implementation  has  a  task  control 
block  (tcb)  associated  witii  it.  For  a  regular  Ada  task  in  the  Ada/Ed  run  time 
system  (rts),  the  tcb  is  given  in  Table  5.3.   Many  of  the  fields  that  are  required 
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for  the  management  of  a  regular  Ada  task  are  not  necessarily  needed  for  the 
management  of  a  minitask;  these  fields  can  be  omitted  from  the  tcb  of  a  minitask. 
Table  5.4  gives  the  fields  that  are  required  to  manage  the  context  of  a  minitask. 

5.5.1      Parallelism  Within  a  Minitask 

As  mentioned  in  the  previous  section,  a  programmer  may  find  it  useful  to  nest 
instantiations  of  the  gen.minitask  generic  package.  Under  this  scenario,  the 
loop_body  procedure  would  contain  its  ovfn  instantiation,  thereby  creating  its 
own  minitask  family.  Matrix  multiphcation  as  described  in  Appendix  A  is  one 
example  of  this  type  of  appHcation. 

Another  common  example  of  nested  parallelism  is  when  a  parallel  library 
package  is  used  inside  of  a  program  that  already  contains  loop-level  parallelism. 
While  the  additionaJ  paralleHsm  may  not  necessarily  improve  performance,  it 
should  nevertheless  be  executed  correctly.  For  this  reason  our  implementation 
of  the  gen.minitask  idiom  supports  ftill  nesting  of  gen.minitask  instantiations. 

Although  the  intended  usage  of  the  gen.minitask  package  is  to  simiilate  a 
parallel  loop,  there  is  nothing  to  prevent  the  threads  of  the  minitask,  represented 
by  the  loop  Jjody  procedure,  from  creating  and  activating  other  tasks.  Since  these 
tasks  can  be  of  any  type,  they  can  contain  entry  points.  In  particular,  they  can 
allow  a  minitask  to  call  an  entry  of  one  of  its  child  tasks.  While  this  may  have  a 
detrimental  effect  on  the  execution  performance  of  a  minitask,  it  is  nonetheless 
vcdid.  Therefore,  the  tcb  of  a  minitask  must  be  expanded. 

Note  that  this  form  of  communication  is  only  one-way.  Since  we  have  assured 
that  minitasks  do  not  contain  any  entry  points,  no  task,  specifically  a  child  task, 
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NAME 

TYPE 

USAGE 

template_base 

integer 

task  template  base  address 

template.offset 

integer 

task  template  offset 

brother 

integer 

next  direct  dependent  of  master  construct 

block -ptr 

pointer 

currently  executing  block  frame 

exception 

integer 

current  exception  raised 

id 

integer 

index  of  task  in  RTS  item 

master.task 

integer 

task  nimiber  of  the  master  task 

master -block 

integer 

number  of  the  master  block 

num  entries 

integer 

number  of  entries 

parent 

integer 

tcb  index  of  parent  task 

rtsjtem 

pointer 

current  rts  item 

tcb.status 

integer 

task's  current  status,  set  before  blocking 

action 

integer 

action  executing  when  blocked 

entryJtem 

pointer 

pointer  to  item  in  entry  queue 

current  .entry 

pointer 

pointer  to  current  entry  queue 

event 

integer 

event  to  imblock  (set  by  others) 

first 

integer 

used  for  task  termination 

ioJtem 

pointer 

current  io  item  (made  when  delay) 

num  J  terns 

integer 

number  of  items 

numjioterm 

integer 

number  of  nonterminatable  direct  dependent 

num.deps 

integer 

number  of  nonterminated  direct  dependent 

num.events 

integer 

number  of  pending  events 

abnormal 

boolean 

flag  set  when  task  is  aborted 

priority 

integer 

task  priority 

rdv 

boolean 

flag  set  when  task  executes  an  open  accept 

save.prio 

integer 

save  area  for  priority 

tcb-serviced 

integer 

current  calling  partners  chain 

next 

integer 

next  task  to  activate  or 

previous  task  being  serviced  by  this  task 

Table  5.3:  The  Ada/Ed  Task  Control  Block 
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NAME 

TYPE 

USAGE 

id 

integer 

index  of  task  in  rts  item 

master.task 

integer 

task  number  of  the  master  task 

master_block 

integer 

number  of  the  master  block 

block.ptr 

pointer 

currently  executing  block  frame 

exception 

integer 

current  exception  raised 

parent 

integer 

tcb  index  of  parent  task 

template.base 

integer 

task  template  base  address 

template-offset 

integer 

task  template  offset 

brother 

integer 

next  direct  dependent  of  master  construct 

rtsJtem 

pointer 

current  rts  item 

tcb_status 

integer 

task's  current  status,  set  before  blocking 

Table  5.4:  Tcb  Fields  Needed  for  Minitask  Implementation 

can  call  an  entry  of  a  minitask.  Therefore,  we  need  only  keep  status  information 
in  the  tcb  that  pertains  to  entry  calls,  rather  than  entry  management.  Table  5.5 
specifies  the  additional  fields  that  need  to  be  included. 

Before  we  discuss  those  fields  that  can  be  eliminated  from  the  tcb  of  a  mini- 
task,  we  prove  the  following  theorem: 

Theorem  5.2  A  minitask  cannot  be  aborted  directly. 


Proof  A  task  is  allowed  to  abort  any  task  (ARM  9.10(10)).  However,  in 
order  to  abort  a  task  directly,  the  name  of  the  task  must  be  supplied  (ARM 
9.10(2)).  Therefore,  only  tasks  that  are  visible  may  be  aborted  directly.  To 
abort  a  minitask,  another  task  would  need  to  name  the  corresponding  task  ob- 
ject, loop-tasks(i),  for  some  i.  This  array  of  tasks  is  only  visible  inside  of  the 
block  statement  of  the  gen  jninitask  generic  package.  Since  no  abort  statement 
exists  anywhere  in  this  statement,  we  conclude  that  minitasks  cannot  be  aborted 
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FIELD  NAME 

TYPE        USAGE 

action 

integer 

action  executing  when  blocked 

entryJtem 

pointer 

pointer  to  item  in  entry  queue 

current-entry 

pointer 

pointer  to  current  entry  queue 

event 

integer 

event  to  unblock  (set  by  others) 

first 

integer 

used  for  task  termination 

ioJtem 

pointer 

current  io item  (made  when  delay) 

numjtems 

integer 

number  of  items 

nuin_noterm       integer 

number  of  nonterminatable  direct  dependents 

num.deps 

integer 

number  of  nonterminated  direct  dependents 

num_events 

integer 

number  of  pending  events 

next 

integer 

next  task  to  activate  or 

previous  task  being  serviced  by  this  task 

Table  5.5:  Tcb  Fields  Needed  for  Minitasks  that  Contain  Tasks 


directly. 


As  mentioned  in  Chapter  3,  a  minitask  can  be  aborted  indirectly  if  its  master 
task  is  aborted.  However,  in  this  case  there  is  no  task  that  retains  visibility  to 
the  minitask,  ehminating  for  any  abort-related  context. 

We  now  turn  to  those  fields  not  included  in  Tables  5.4  or  5.5.  With  each 
field  we  give  a  brief  description  of  why  it  is  not  needed  in  the  tcb  of  a  minitask. 

abnormal  This  boolean  is  set  to  true  when  a  task  reaches  an  abnormal  state;  it 
is  either  aborted  or  an  exception  is  raised  during  its  execution  that  is  not 
handled.  By  theorem  5.2,  a  minitask  cannot  be  aborted  directly.  When  a 
minitask  is  aborted  Since  zJl  exceptions  are  caught  by  the  handler  located 
in  each  minitask,  (see  Figure  5.5),  this  field  is  not  needed. 

priority   This  is  the  priority  of  a   minitask.     A   priority  is  associated   with   a 
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task  type.  Since  the  minitask  task  type  is  specified  without  a  priority,  a 
predetermined  priority  is  assigned  to  it  by  the  implementation,  ehminating 
the  need  for  this  field  in  the  tcb. 

rdv  This  boolean  flag  is  set  when  a  task  executes  an  open  accept  statement. 
It  signifies  that  a  calhng  task  may  participate  in  a  rendezvous  immedi- 
ately. Since  minitasks  contain  no  entry  points,  no  accept  statements  can 
be  executed,  rendering  this  flag  unnecessary. 

num_entries  This  field  contains  the  total  number  of  entries  associated  with 
a  task.  Since  a  minitask  does  not  contain  any  entries,  this  field  can  be 
ehminated. 

save.prio  This  field  is  used  to  save  the  priority  of  a  calfing  task  when  it  per- 
forms a  rendezvous  with  this  task.  Since  minitasks  do  not  have  any  entries, 
it  need  not  allocate  storage  for  any  potential  callers'  priorities. 

tcb_serviced  This  field  corresponds  to  the  current  calling  task  of  an  entry. 
Once  again,  since  minitasks  do  not  contain  ciny  entries,  this  field  is  not 
required. 

5.5.2      Further  Reducing  Beacon  Overhead 

We  have  just  seen  how  to  reduce  the  amount  of  storage  associated  with  a  mini- 
task.  Now  we  turn  out  attention  to  the  operations  a  minitask  must  perform  before 
it  executes  the  "sequence  of  statements"  given  by  the  loop_body  procedure. 

Consider  the  body  of  the  minitask  as  shown  in  Figure  5.5.   The  first  action 
an  activated  minitask  performs  is  the  caJl  to  the  faa  function.    This  allows  a 
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task  body  minitask  is 

iterate  :  iteration_range; 

procedure  record.exception(exception. raised  :  exception.type)  is 
begin 

task_status(iterate). exception. kind  :=  exception.raised; 

task_status(iterate). completed  :=  false; 

success  :=  false; 
end  record.exception; 

begin 

iterate  :=  iteration_range(iteration.counter.faa(l)); 
loop  J3ody(  iterate); 

exception 

when  CONSTRAINT_ERROR  =>  recordjexception(Constraint); 
when  NUMERICJERROR  =>  record.exception(Numeric); 
when  PROGRAM_ERROR  =>  record-exception(Program); 
when  STORAGE_ERROR  =>  record_exception(Storage); 
when  TASKING_ERR0R  =>  record.exception(Tasking); 
when  others  =>  record-exception(Other); 
end  minitask; 

Figure  5.5:  The  Body  of  Minitask  Revisited 

minitask  to  obtain  a  unique  identity  in  the  appropriate  iteration  range.  The 
value  returned  is  converted  into  an  integer  type  and  assigned  to  the  local  variable, 
iterate. 

As  specified  in  Chapter  3,  the  function  call  to  faa  can  be  translated  into  a 
fetch.and.add  operation.  However,  in  the  context  of  minitasks  a  niore  efficient 
implementation  can  be  realized. 

In  our  implementation,  tasks  are  created  and  activated  in  parallel;  the  task 
performing  the  elaboration  of  loop.tasks  enqueues  an  item  on  to  a  parallel  queue. 
This  item  has  a  multiplicity  corresponding  to  the  number  of  tasks  to  be  created. 
As  available  processes  perform  a  dequeue  operation,  the  multiplicity  of  this  item 
is  decremented.  The  processor  that  performs  the  decrement  receives  aji  identity 
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that  uniquely  distinguishes  the  task  to  be  created. 

At  the  heart  of  the  parallel  queue  aJgorithm  is  the  fetch. and_add  primitive.^ 
Although  the  value  that  is  returned  by  this  fetch.and.add  operation  is  not  avail- 
able to  the  Ada  programmer,  it  nevertheless  does  exist.  The  first  operation  of 
a  minitask  is  to  acquire  a  unique  identity  via  the  fetch_and_add  construct.  Since 
this  identity  has  already  been  obtained  by  the  dequeue  operation,  there  is  no 
need  to  execute  a  fetch.and.add  operation  at  the  time  a  minitask  is  created. 

As  suggested  by  the  code  in  Figure  5.5,  after  executing  the  fetch.and.add 
operation,  a  constrciint  check  is  made  to  ensure  that  the  vaJue  returned  by  faa 
is  in  the  bounds  of  iteration.range.  However,  the  following  theorem  allows  this 
check  to  be  eliminated. 

Theorem  5.3  The  value  returned  by  the  faa  function  in  the  body  o/ minitask  is 
always  within  the  range  iteration.range'first  .  . .  iteration_range'last. 

Proof  Upon  instantiation  of  the  gen  J>eacon  package,  the  initial  value  sup- 
plied for  the  beacon  variable  is  iteration.range'first.  The  first  call  to  faa  returns 
this  initial  value,  while  incrementing  the  beacon  variable  by  one.  By  inspecting 
the  body  of  the  minitask  we  see  that  exactly  one  caJl  to  faa  is  made  for  each  mini- 
task.  Since  iteration.range'last  -  iteration.range'first  -|-  1  minitasks  exist,  precisely 
this  number  of  calls  to  faa  are  made.  Since  each  call  provides  a  unit  increment, 
the  vaJues  iteration.range'first,  iterationj'ange'first  -f  1,  . ..  iterationj-ange'last  are 
returned  to  each  of  these  calls.  Therefore,  the  vaJues  returned  by  the  faa  satisfy 
the  constrciint  specified  by  the  iteration.range  type.  ^ 


'For  more  information  about  this  algorithm  sec  [Hum88]. 
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Theorem  5.3  allows  us  to  remove  the  constraint  check  that  is  associated  with 
the  faa  call.  Since  the  identity  values  returned  by  the  dequeue  operation  of  the 
Ada/Ed  RTS  is  a  value  starting  at  1  [Hum88],  our  implementation  simply  adds 
iterationjange'first  to  this  identity  to  obtciin  the  proper  iterate.  This  addition 
replaces  the  faa  function  call. 

5.5.3      Ada  Shared  Variables 

In  Section  3.3.1  of  Chapter  3  we  discussed  how  Ada  shared  variables  and  the 
gen.beacon  package  interact.  In  this  section  we  explore  how  the  potential  use  of 
these  variables  affect  the  implementation  of  the  gen.minitask  package. 

Recall  from  Chapter  2  that  an  Ada  shared  variable  is  a  scalar  or  access  vari- 
able that  is  not  declared  as  pragma  shared,  but  is  nonetheless  used  in  a  shared 
fashion,  i.e.  by  several  tasks  without  explicit  synchronization.  The  manner  in 
which  these  types  of  shared  variables  can  be  used  is  described  in  ARM    9.11(3-6): 

"For  the  actions  performed  by  a  program  that  uses  shared  variables, 
the  following  assumptions  can  always  be  made: 

•  If  between  two  synchronization  points  of  a  task,  this  task  reads 
a  shared  variable  whose  type  is  a  scalar  or  access  type,  then  the 
variable  is  not  updated  by  any  other  task  at  any  time  between 
these  two  points. 

•  If  between  two  synchronization  points  of  a  task,  this  task  up- 
dates a  shared  variable  whose  type  is  a  scalar  of  access  type, 
then  the  variable  is  neither  read  nor  updated  by  any  other  task 
at  any  time  between  these  two  points. 
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The  execution  of  the  program  is  erroneous  if  any  of  these  assumptions 
is  violated." 
A  synchronization  point  for  a  shared  variable  is  defined  in  the  following: 

"Two  tasks  are  synchronized  at  the  start  and  end  of  their  rendezvous. 
At  the  start  and  at  the  end  of  its  activation,  a  task  is  synchronized 
with  the  task  that  causes  this  activation.  A  task  that  has  completed 
its  execution  is  synchronized  vsdth  any  other  task."  ARM  9.11(2) 

As  mentioned  in  ARM  9.11(7)  an  implementation  is  allowed  to  keep  local 
(cache)  copies  of  an  Ada  shared  variable  associated  with  a  task,  flushing  them 
to  global  memory  at  the  task's  synchronization  points.  Although  the  gen.- 
minitask  and  gen_beacon  generic  packages  do  not  contain  any  Ada  shared  vari- 
ables, we  cannot  guarantee  that  a  program  that  uses  these  generics  does  not 
contain  these  variables.  Therefore,  in  our  implementation  of  an  instantiation  of 
the  gen.minitask  generic,  we  must  ensure  that  proper  semantics  are  upheld  in 
respect  to  these  Ada  shared  variables. 

Note  that  although  minitasks  are  not  visible  to  the  users  of  the  gen.minitask 
generic,  the  reference  manual  nevertheless  requires  that  the  above  mentioned 
synchronization  points  be  honored  with  respect  to  Ada  shared  variables. 

In  considering  how  these  synchronization  points  affect  the  minitask  imple- 
mentation, we  consider  each  potential  point  in  turn. 

At  the  start  of  its  activation:  The  beginning  of  a  minitask  activation  is  the 
first  synchronization  point  a  minitask  reaches.  Since  no  memory  accesses 
have  been  made  by  the  minitask,  no  loccd  copies  exist;  no  cache  flushes 
need  to  be  performed  for  the  activating  minitask. 
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The  start  of  an  activation  is  also  a  synchronization  point  with  the  task  that 
causes  this  activation,  the  task  that  instantiates  the  gen.minitask  generic. 
Since  this  task  can  possess  local  copies  of  Ada  shared  variables,  they  must 
be  flushed  at  the  start  of  minitask  activation. 

At  the  end  of  its  activation:  Since  no  Ada  shared  variable  is  accessed  during 
the  activation  of  a  minitask,  no  cache  flush  is  required.  Similarly,  as  the 
instantiating  task  does  not  access  any  variables  since  the  start  of  this 
activation,  no  cache  flush  is  required  for  this  task. 

At  the  start  of  a  rendezvous:  Since  minitasks  do  not  contain  entry  points, 
they  can  only  execute  rendezvous  when  they  call  another  task.  As  we 
saw  in  Chapter  3,  the  call  to  the  faa  function  results  in  a  rendezvous. 
In  the  general  case,  this  rendezvous  would  require  a  cache  flush  of  any 
of  the  caller's  Ada  shared  variables.  However,  this  statement  is  the  first 
statement  executed  by  a  minitask;  no  local  copies  can  exist.  Thus,  no  cache 
flush  is  required  for  this  synchronization  point. 

If  the  procedure  parameter  to  gen_minitask  contains  tasks,  then  other  ren- 
dezvous are  possible.  Under  this  scenario,  the  procedure  would  declare  a 
task  and  then  call  one  of  its  entries.  This  call  constitutes  a  synchroniza- 
tion point  for  any  Ada  shared  variables  this  procedure  may  have  accessed. 
Therefore,  at  this  time  any  local  copies  of  these  variables  must  be  flushed. 

At  the  end  of  a  rendezvous:  Between  the  beginning  and  end  of  the  ren- 
dezvous that  is  executed  by  the  call  to  the  faa  function,  only  two  operations 
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occur:  the  current  value  of  the  beacon  shared  variable  is  saved  to  be  re- 
turned, and  the  beacon  shared  variable  is  incremented  by  the  appropriate 
value.  As  the  beacon  shared  variable  is  actually  a  local  variable  to  the 
beacon  task,  no  access  to  an  Ada  shared  variable  is  performed  during  the 
rendezvous.  For  this  reason,  no  cache  flush  is  required  at  this  time. 

If  other  minitask  rendezvous  exist,  a  cache  flush  is  required  if  any  Ada 
shared  variables  have  been  updated  during  the  rendezvous. 

At  the  completion  of  a  task's  execution:  When  a  minitask  completes  its 
execution  it  can  potentially  have  reference  Ada  shared  variables.  These 
variable  coiald  have  been  cached  locally  and  used  by  the  instantiating  task. 
Therefore,  when  a  minitask  completes  its  execution,  any  Ada  shared  vari- 
ables that  have  been  store  locally  are  flushed  into  global  memory. 

These  results  are  summarized  by  Table  5.6. 

In  Section  5.5.2,  we  mentioned  that  the  function  call  to  faa  can  be  optimized 
away.  However,  we  must  note  that  this  function  call  led  to  a  rendezvous,  and 
thus,  a  synchronization  point  for  Ada  shared  variables.  Although  this  operation 
is  no  longer  needed,  we  must  still  ensure  that  the  proper  semantics  are  upheld 
with  respect  to  any  Ada  shared  variables.  Specifically,  if  any  local  copies  of 
an  Ada  shared  variable  exist,  they  must  be  flushed  at  the  point  where  the 
rendezvous  would  have  occurred.  As  described  above,  no  local  copies  of  Ada 
shared  variables  Ccin  exist,  allowing  for  the  cache  flush,  too,  to  be  optimized 
away. 

The  following  theorem  aillows  for  another  optimization  to  be  performed. 
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Synchronization  Point 

Action! 

Start  of  activation 

Flush  activator's  cache 

End  of  activation 

None 

Start  of  faa  rendezvous 

None 

Start  of  any  other  rendezvous 

Flush  both  tasks'  caches 

End  of  faa  rendezvous 

None 

End  of  any  other  rendezvous 

Flush  both  tasks'  caches 

Completion  of  task's  execution 

Flush  terminating  task's  cache 

f'Flush  Cache"  refers  to  flushing  all  Ada  shared  variables  that  have  been  written  to 
since  the  last  cache  flush. 

Table  5.6:  Cache  Actions  Required  for  Minitasks 

Theorem  5.4  The  instantiation  o/ iteration  .counter  requires  no  run  time  action 
to  be  performed. 

Proof  Consider  iteration-counter,  an  instantiation  of  the  gen.beacon  generic 
package  as  shown  in  Figure  5.6.  This  instantiation  provides  three  subprograms: 
read,  write,  and  faa.  In  the  body  of  each  minitask  object,  only  the  faa  function 
is  called.  From  the  discussion  in  the  previous  paragraph  and  Section  5.5.2, 
we  have  concluded  that  this  call  need  not  be  performed.  As  this  is  the  only 
call  that  accesses  the  underlying  beacon  variable,  no  allocation  of  the  beacon 
variable,  and  therefore,  the  gen.beacon_struct  need  occur.  Hence,  instantiating 
iteration.counter  does  not  require  any  run  time  action.  % 


5.5.4      Handling  Exceptions 

This  section  describes  the  method  employed  to  handle  exceptions  that  may  arise 
during  the  execution  of  an  instantiation  of  gen  minitask.  Two  types  of  exceptions 
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package  body  gen.minitask  is 

Create  a  fetch_and.add  variable  for  minitask  initialization. 

package  iteration-counter  is  new 

gen_beacon(integer(iteration_range'first)); 

begin 

declare 

loop.tasks  :  array(iteration  j-ange)  of  minitask; 
begin 

null  ; 
end; 
end  gen.minitask; 

Figure  5.6:  The  Instantiation  of  Gen. Beacon  Revisited 
can  be  raised  during  the  instantiation  itself: 

•  an  exception  raised  during  the  elaboration  of  the  package  declaration,  and 

•  an  exception  raised  during  the  execution  of  the  package  body. 

If  an  exception  is  raised  during  the  elaboration  of  the  generic  package  decla- 
ration, this  exception  is  propagated  to  the  declarative  part  that  instantiated  the 
generic  (ARM  11.4.2(7)).  If  an  exception  is  raised  while  elaborating  the  declara- 
tive part  of  gen.minitask,  our  implementation  simply  propagates  this  exception, 
and  abandons  any  further  action  concerning  this  instantiation. 

When  an  exception  is  rziised  during  the  execution  of  the  package  body  of  the 
minitask  generic,  different  semantics  are  in  effect.  As  mentioned  in  section  4.5.2, 
the  manner  in  which  this  type  of  exception  is  handled  depends  on  where  in  the 
minitask  the  exception  is  raised  and  whether  an  exception  hamdler  exists  for  the 
generic  parameter,  loopJjody. 


CHAPTERS.    GEN  MINITASK  IMPLEMENTATION  125 

Consider  the  case  where  an  exception  is  raised  and  handled  inside  of  loop- 
Jjody.  Under  this  scenario,  the  standard  implementation  ensures  that  control 
passes  to  the  appropriate  exception  handler;  no  specicil  implementation  is  re- 
quired. 

Now  consider  the  case  when  an  exception  is  raised  inside  of  the  minitask, 
but  is  not  handled  by  the  loop_body  procedure.  A  "normal"  implementation  of 
a  minitask  would  install  a  handler  upon  entering  the  body  of  this  task.  This 
handler  would  contain  pointers  to  instructions  to  be  executed  depending  on 
what  exception  is  raised.  Since  the  minitask  handler  is  known,  we  can  create 
a  predefined  handler  that  satisfies  the  semantics  of  the  handler  specified  in  the 
body  of  the  minitask.  Using  this  approach  we  are  able  to  avoid  the  overhead 
involved  with  calhng  the  procedure,  record.exception;  we  execute  its  instructions 
inline. 

Figures  5.8  and  5.9  summarize  the  implementation  described  in  this  chapter. 
Figure  5.8  specifies  the  actions  that  the  instantiating  task  executes,  while  Fig- 
ure 5.9  describes  the  manner  in  which  each  minitask  is  created,  activated,  and 
commences  execution.  In  the  latter  aJgorithm  we  assume  that  a  process  which 
is  not  currently  blocked  on  a  minitask  has  dequeued  a  minitask  item  from  the 
queue.  This  item  with  multiphcity  N  is  enqueued  by  the  former  algorithm. 

Upon  returning  from  the  loop_body  procedure  the  minitask  algorithm  at- 
tempts to  begin  the  execution  of  another  minitask.  If  no  more  minitasks  exist, 
this  routine  terminates. 

If  an  exception  is  raised  or  propagated  to  the  body  of  the  routine  in  Fig- 
ure 5.9,  it  is  handled  as  previously  described;  a  default  minitask  exception  han- 
dler is  present  to  handle  these  exceptions  appropriately. 
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procedure  handle.exception 
INPUTS:  exception,  iterate 
MODIFIED:  success,  task.comp,  task.exc 
of  current  instantiation, 
begin 

case  exception  is 

when  CONSTRAINT-ERROR   => 
task.exc(iterate)  :=  Constraint; 
task-comp(iterate)  :=  false; 
success  :=  false; 
when  Nl"MERIC_ERROR  => 
task.exc(iterate)  :=  Numeric; 
task.comp(iterate)  :=  false; 
success  :=  false; 
when  PROGRAM_ERROR   => 
task.exc( iterate)  :=  Program; 
task.comp(iterate)  :=  false; 
success  :=  false; 
when  STORAGE-ERROR   => 
task.exc(iterate)  :=  Storage; 
task.comp(iterate)  :=  false; 
success  :=  false; 
when  taskingj:rror  => 
task-exc(iterate)  :=  Tasking; 
task_comp(iterate)  :=  false; 
success  :=  false; 
when  others  => 

task.exc(iterate)  :=  Others; 
task_comp(iterate)  :=  false; 
success  :=  false; 
end  handle-exception; 

Figure  5.7:  Minitask  Exception  Handling  Algorithm 
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procedure  create_minitaskJamily 

INPUTS:  iteration_range,  loop-body 

OVTPUTS:  success 
begin 

success  :=  alloc  (1); 

if  alloc  is  unsuccessful  then 

raise  STORAGE-ERROR  in  instantiating  declarative  part. 

end  if; 

success  :=  true; 

Let  A'  =  iteration.range'last  -  iteration.range'first  +  1. 

task.comp  :=  alloc  (.V); 

task.exc  ;=  alloc (.V); 

if  either  alloc  is  unsuccessful  then 

raise  ST0RAGE_ERR0R  in  instantiating  declarative  part. 

end  if; 

Mark  exception.type,  task.completion  and  task.exception 
visible  to  the  instantiator. 

if  processor  is  already  executing  a  minitask  then 

Execute  nninitasks  sequentially. 
else  —  Execute  minitasks  in  parallel. 

my_tcb.num.items  :=  A'; 

node  :=  alloc  (miM_RTS.NODE_SIZE); 

if  alloc  is  unsuccessful  then 
raise  storage_ERROR. 

end  if; 

node  mult  :=  A'; 

node.proc  ;=  loopJ)ody; 

node  range    =  iteration j'ange; 

node  parent  :=  myjd, 

node. type  ;=  CREATE.MIM; 

enqueue (node); 

block  until  a  nninitask  child  unblocks  us. 
end  if, 
end  create.minitask.family; 

Figure  5.S:  Instantiating  Teisk's  Algorithm 
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procedure  create_and_execute.minitask 

INPUTS:  node 
begin 

Initialize  local  stack  and  tcb. 

Mark  that  this  processor  is  executing  a  minitask. 

loop 

iterate  :=  node. id  +  node.iteration.range'first; 
task. comp  (iterate)  :=  true; 
task_exc (iterate)  :=  None; 
loop_body  (iterate); 

Check  if  we  are  the  last  minitask  to  terminate. 

if  fetch.and.add  (node. parent. num.items.-l)  =  1  then 
unblock  parent; 

Mark  that  processor  is  NOT  executing  a  minitask. 
exit; 
else  —  Try  to  get  another  minitask. 

node  :=  dequeue  next  minitask  in  this  family; 
if  node  =  NIL  then 

Mark  that  processor  is  NOT  executing  a  minitask. 
exit; 
end  if; 
end  if; 
end  loop; 
end  create. and.execute.minitask; 

Figure  5.9:  Algorithm  to  Create,  and  Execute  a  Minitask 
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Function  Call 

Value  Returned 

my_rnini.task.completion(i); 

inyjTiiiii.task.coinp(i) 

niy_mini.task_exception(i); 

my_mini.task-exc(i) 

Note:  iteration.range  refers  to  the  type  paraiineter  that  was  specified  for  the  instantiation 
of  my.mini. 

Table  5.7:    Implementation  of  the  task.completion  and  task.exception  Subpro- 
grams 

The  two  functions,  task_completion  and  task.exception  are  implemented  in  a 
straightforward  manner.  Each  call  is  translated  to  the  appropriate  dereference 
operation  into  the  two  run  time  storage  arrays:  task.comp  and  task.exc.  return- 
ing the  appropriate  value  to  the  caller.  Table  5.7  specifies  the  implementation 
details  for  these  two  functions.  It  assumes  that  the  gen_minitask  instantiation  is 
called  myjnini. 


Chapter  6 


Performance  Analysis 


In  this  chapter  we  analyze  the  performance  of  three  implementations  of  a  par- 
allel loop  written  in  Ada.  The  first  two  methods  represent  standard  ways  of 
expressing  such  a  parallel  loop.  We  describe  the  amount  of  overhead  that  is 
required  to  execute  both  of  these  methods.  These  figures  are  compared  to  the 
execution  overhead  of  our  implementation  using  the  gen.minitask  idiom.  We 
show  how  the  latter  overhead  compares  favorably  with  the  other  two.  We  then 
examine  issues  of  granularity  and  efficiency  for  all  three  forms  of  parallel  loops, 
supporting  the  conclusion  that  our  idiom  provides  a  more  efficient  version  of 
loop-level  paralleHsm. 

We  use  the  standard  means  of  expressing  a  parallel  loop  in  Ada  as  shown 
in  Figure  6.1.  Throughout  the  rest  of  this  chapter,  we  refer  to  this  code  as  an 
Ada  "parallel  loop  equivalent"  or  PLE.  Figure  6.2  contains  the  corresponding 
example  of  the  gen.minitask  idiom  or  GMI. 

Although  both  versions  of  the  paraUel  loop  offer  similar  semantics,  the  GMI 
provides  an  increase  in  functionality.   In  the  Ada  PLE  there  is  no  mechanism 
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declare 

task  type  iteration.task  is 

entry  get  Jterate  (iterate  :  in  integer); 
end  iteration.task; 

parJoop  :  array  (1..N)  of  iteration.task; 

task  body  iterationJask  is 

my  Jterate  :   integer; 
begin 

accept  get  Jterate  (iterate  :  in  integer)  do 

my.iterate  :=  iterate; Acquire  iterate  from  parent. 

end  get.iterate; 

Si 

end  iteration.task; 

begin  —  AH  "iterations"  begin  to  execute. 

for  i  in  1..N  loop  —  Iterations  are  distributed  sequentially, 
par  Joop(i).getJterate  (i); 

end  loop; 
end;  —  Waiit  for  all  iteration-tasks  to  terminate. 

Figure  6.1:  An  Ada  Parallel  Loop  Equivcdent  [PLE) 
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declare 

subtype  myJoop_range  is  integer  range  1..N; 
procedure  myJoop.body  (nnyjterate  :   myJoop.range)  is 
begin 

Si: 

5„: 
end  my.loop; 

—  This  instantiation  corresponds  to  the  execution 

of  all  N  iterations. 

package  my-minitask  is  new 

gen.minltask  (myJoop_range,myJoop_body); 
begin 

if  not  my  jninitask. success  then 

—  Handle  error  condition. 
end  if  ; 
end; 

Figure  6.2:  A  Parallel  Loop  Using  the  Gen.Minitask  Idiom  {GMI) 

for  the  parent  thread  to  inspect  the  status  of  the  tasks  that  play  the  role  of 
iterates;  the  conditional  in  Figure  6.2  cannot  be  written  in  Figure  6.1.  For 
this  reason,  we  do  not  include  the  instructions  associated  with  executing  this 
conditioned  in  computing  its  overhead.  However,  we  do  account  for  the  allocation 
and  recording  of  status  information  that  is  used  when  executing  this  conditional. 
The  GMI  cdso  provides  two  functions  that  eillow  the  parent  thread  to  inspect  the 
completion  status  of  each  iteration.  It  thereby  provides  a  more  robust  version 
of  a  parcJlel  loop,  in  addition  to  a  reduction  in  overhead. 

The  rest  of  this  chapter  is  organized  in  the  following  manner.  First  we  present 
the  assumptions  that  underly  our  analysis.  Next  we  analyze  the  overhead  asso- 
ciated with  the  Ada  PLE  and  a  tree-based  variant.  This  is  followed  by  a  similar 
analysis  of  the  gen_minitask  idiom.  We  conclude  the  chapter  by  comparing  both 
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methods  of  obtaining  parallel  loops  in  Ada  and  suggest  further  enhancements. 

6.1      Preliminaries 

As  noted  in  Chapter  2,  our  implementation  is  based  on  the  Ada/Ed  system,  a 
descendant  of  the  first  validated  Ada  compiler  [DFS''"80].  In  addition  to  pro- 
viding the  basis  for  the  implementation  of  the  gen_minitask  idiom,  we  use  the 
Ada/Ed  implementation  to  determine  the  overhead  associated  vfith  executing 
the  Ada  PLE.  Although  this  overhead  figure  is  Ukely  to  differ  from  the  corre- 
sponding figures  of  other  implementations,  we  feel  that  as  our  implementation 
utilizes  the  same  compiler  model  as  the  Ada/Ed  system,  it  is  a  fair  comparison. 
Moreover,  sources  of  overhead  in  the  actions  performed  by  the  Ada/Ed  system 
are  mandated  by  the  semantics  of  Ada,  and  therefore,  must  be  performed  by 
any  implementation. 

Since  we  do  not  have  access  to  an  Ada  implementation  on  a  parallel  machine, 
our  overhead  analysis  is  static.  As  both  the  Ada/Ed  system  and  our  implemen- 
tation are  interpreters  written  in  C,  the  unit  of  measure  we  use  in  our  aneilysis  is 
the  number  of  C  instructions  executed  by  each  implementation.  Although  this 
approach  may  not  be  as  accurate  as  counting  machine  instructions,  we  feel  that 
the  results  presented  in  this  chapter  are  a  good  approximation  of  the  magnitude 
of  the  overhead  of  both,  our  implementation,  and  a  standard  implementation  of 
the  Ada  PLE. 
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6.1.1      Assumptions 

In  order  to  provide  a  reasonable  measurement  of  overhead  we  make  some  sim- 
plifying assumptions: 

1.  All  tasks  execute  at  the  same  rate.  If  N  tasks  execute  the  same  ten 
instructions  starting  at  the  same  time,  we  assume  all  complete  in  the  time 
it  takes  one  task  to  execute  ten  instructions.  To  compare  with  a  sequential 
execution  we  say  that  ten  instructions  are  executed. 

2.  There  are  sufficient  processors  available  to  execute  all  parallel  actions.  If 
A'^  tasks  are  eligible  to  execute,  we  assume  at  least  N  processors  exist  to 
execute  them.  It  is  a  simple  matter  to  scale  our  results  when  the  number 
of  tasks  is  greater  than  the  number  of  processors. 

3.  No  error  conditions  arise.  Although  it  is  certainly  possible  for  a  run  time 
error  (e.g.  storage.error)  to  occur,  we  cire  most  interested  in  the  cases  where 
no  errors  occur.  Thus,  when  code  is  encountered  that  is  conditionally 
dependent  on  whether  aji  error  condition  has  occurred,  we  assume  that 
the  non-error  branch  is  taken. 

6.2      The  Ada  Parallel  Loop  Equivalent  (PLE) 

In  this  section  we  describe  the  amount  of  overhead  that  is  required  to  execute  the 
Ada  PLE  in  the  Ada/Ed  system.  We  divide  the  instructions  that  are  executed 
by  the  Ada  PLE  into  two  categories: 
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Task  Initiation:  This  category  includes  the  instructions  reqviired  to  initiate 
the  tasks  of  the  array. 

Task  Initialization:  Once  the  tasks  have  been  initiated,  their  iterates  must  be 
distributed.  This  category  includes  the  instructions  required  to  distribute 
unique  identities  to  each  task,  i.e.  to  solve  the  task  initialization  problem. 

In  the  next  two  sections  we  discuss  the  overhead  associated  with  each  of 
these  categories  in  executing  the  Ada  PLE. 

6.2.1      Task  Initiation 

Although  task  creation  and  task  activation  comprise  a  major  portion  of  task 
initiation,  other  actions  must  be  performed.  These  actions  include: 

Pre-Task  Creation:  This  category  includes  actions  that  are  completed  before 
the  tasks  are  created.  These  actions  create  the  Ada  objects  that  are  needed 
to  construct  a  parallel  loop.  They  include: 

•  the  elaboration  of  the  task  type,  iteration.task,  and 

•  the  elaboration  of  the  array,  parJoop,  up  to,  but  not  including,  the 
creation  of  the  tasks. 

Task  Creation:  Once  the  task  type  and  array  have  been  created,  the  creation 
of  each  task  takes  place.  This  involves  allocating  and  initializing  the  task's 
stack  and  tcb.  As  the  reference  manual  specifies  that  the  order  in  which 
these  creations  occur  is  "not  defined  by  the  language"  (ARM  9.2(2),  3.2.1), 
the  Ada/Ed  system  performs  them  in  parallel. 
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Post  Task  Creation:  After  all  tasks  are  created  the  rest  of  the  declarative  part 
is  elaborated.  This  includes  elaborating  the  task  body  of  iteration.task. 

Task  Activation:  At  the  conclusion  of  the  elaboration  of  the  declarative  part 
all  tasks  are  activated.  This  activation  occurs  in  two  steps: 

•  The  parent  task  notifies  all  tasks  that  they  may  begin  their  execution; 
it  places  them  on  the  ready  queue. 

•  Each  new  task  elaborates  its  declarative  part. 

Post  Execution:  Once  all  tasks  have  terminated,  their  stacks  are  reclaimed 
and  the  stack  frame  associated  with  the  block  statement  in  the  parent 
task  is  popped. 

We  specify  the  number  of  instructions  needed  to  perform  each  of  these  actions 
in  Table  6.1.  As  this  table  illustrates,  pre-task  creation  accounts  for  aJmost  50% 
of  the  instructions  that  comprise  task  initiation.  This  is  due  to  the  elaboration 
of  the  entities  (task  type  and  array)  that  cire  needed  to  specify  a  parallel  loop 
in  Ada.  As  expected,  most  of  the  remaining  overhead  is  due  to  task  creation 
(22.1%)  and  task  activation  (23.3%).  Although  these  actions  are  also  present 
in  the  GMI,  we  shall  see  that  most  actions  performed  in  the  pre-task  creation 
category  are  not  required,  therefore  allowing  for  the  ehmination  of  a  significant 
portion  of  task  initiation  overhead. 

6.2.2      Task  Initialization 

In  this  section  we  examine  the  overhead  that  results  when  task  initialization  is 
performed  sequentially.    We  show  that  this  process  accounts  for  a  substantial 
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Action 
Performed 

Number  of 
Instructions 

Percentage 
OF  Total 

Pre-Task  Creation 

465 

47.7% 

Task  Creation 

216 

22.1% 

Post  Task  Creation 

33 

3.4% 

Task  Activation 

227 

23.3% 

Post  Execution 

34 

3.5% 

Total 

975 

100.0% 

Table  6.1:  Task  Initiation  Overhead  for  the  Ada  PLE 


amount  of  the  overhead  associated  with  the  Ada  PLE. 

In  order  to  distribute  a  unique  identity,  or  iterate,  to  each  task,  a  task  must 
call  each  child  task's  entry,  specifying  the  iterate  as  a  paxameter.  In  the  case 
of  the  Ada  PLE,  these  calls  are  performed  by  the  parent  task.  As  it  is  not 
possible  for  these  entry  calls  to  be  performed  parallel,  iterates  are  distributed 
sequentially  (see  Figure  6.1  on  page  130).  Since  each  child  task  cannot  begin  its 
execution  until  it  receives  this  iterate,  the  first  action  it  performs  is  to  execute 
an  accept  statement  on  that  entry. 

At  the  conclusion  of  the  rendezvous  the  child  task  begins  executing  the  itera- 
tion of  the  parallel  loop  it  has  been  assigned.  At  the  same  time,  the  parent  task 
continues  executing  its  for  loop  and  calls  the  entry  of  the  next  task.  This  process 
continues  until  all  iterates  have  been  distributed.  As  this  is  a  sericil  operation 
for  the  parent,  the  amount  of  delay  a  child  tcisk  must  endure  is  dependent  on 
when  its  get. iterate  entry  is  called  by  the  parent. 

Table  6.2  specifies  the  overhead  involved  in  distributing  iterates  to  each  child 
task.  Since  this  involves  the  coordination  of  two  tasks,  some  assumptions  about 
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Action(s)  Performed 

Parent  Task 

Child  Task 

Pre-Entry  Call/Pre-Accept  Stmt. 

165 

16 

Entry  Call/Accept  Stmt. 

59 

198 

Rendezvous 

97 

Post  Entry  Call 

130 

t 

Loop  Overhead 

140 

Entry  Call/Accept  Stmt. 

59 

198 

Rendezvous 

97 

Post  Entry  Call 

130 

t 

': 

; 

': 

I  At  this  point  the  child  task  begins  its  execution. 


Table  6.2:  Task  Initialization  Overhead  for  the  Ada  PLE 


the  execution  rate  of  each  task  must  be  made.  As  mentioned  in  Section  6.1.1, 
our  analysis  assumes  that  enough  processors  are  available  to  execute  each  task, 
and  that  each  task  executes  at  the  same  rate.  Therefore,  while  the  parent  task 
executes  the  instructions  labeled  pre-entry  call  and  entry  call,  each  of  the  A^ 
children  tasks  execute  the  instructions  labeled  pre-accept  statement  and  accept 
statement.  Since  the  number  of  instructions  required  to  execute  the  pre-entry 
CcJI  category  is  greater  than  the  number  of  instructions  reqiured  to  execute 
the  pre-accept  statement  category  (165  vs.  16),  we  assume  that  each  child  task 
reaches  its  accept  statement  before  the  parent  calls  any  entry. 

While  a  child  task  waits  for  the  parent  task  to  perform  its  entry  call,  it 
is  blocked.  In  the  Ada/Ed  system  a  rendezvous  is  executed  by  the  processor 
of  the  task  that  executes  the  accept  statement.  Thus,  when  the  paxent  task 
executes  this  entry  call,  it  awakens  the  child  task.  This  blocking  of  the  child 
task,  and  its  subsequent  awakening  by  the  parent  account  for  the  majority  of 


CHAPTER  6.    PERFORMANCE  ANALYSIS  139 

the  198  instructions  that  are  attributed  to  the  child  task  in  the  accept  statement 
category  of  Table  6.2. 

Once  the  parent  and  first  child  task  have  executed  their  pre-rendezvous  in- 
structions, a  rendezvous  occurs.  While  the  body  of  the  accept  statement  is 
performed  by  the  child's  processor,  the  parent  task  is  blocked.  The  body  of  the 
accept  statement  requires  97  instructions  to  execute.  Upon  its  completion  the 
child  task  begins  to  execute  the  iteration  of  the  parallel  loop  that  it  has  been 
assigned.  At  this  point  this  iteration  has  begun. 

The  parent  task,  however,  has  only  begun  with  its  job.  After  the  rendezvous 
has  completed,  the  processor  executing  the  parent  task  executes  instructions 
labeled  post  entry  call  to  unblock  the  parent  ta^k.  Once  the  parent  task  has 
been  unblocked,  it  attempts  to  execute  the  next  iteration  of  the  for  loop,  which 
includes  some  loop  overhead.  Once  this  has  been  performed,  the  next  entry  call 
is  made,  and  this  process  continues. 

From  Table  6.2  we  can  conclude  that  321  instructions  are  required  to  ini- 
tialized the  first  task  using  the  Ada  PLE.  Combined  with  the  975  instructions 
that  are  executed  by  the  parent  task  to  initiate  this  task,  we  see  that  1,296 
instructions  are  executed  before  the  first  iteration  commences. 

We  define  INTLpu{N)  to  be  the  number  of  instructions  required  to  initiaJize 
N  iterations  using  the  Ada  PLE.  Note  that  INTLpif  does  not  include  those 
instructions  required  to  initiate  these  A^  tasks. 

Table  6.2  allows  us  to  compute  when  the  second  and  subsequent  children 
tasks  begin  executing  their  iterations.  As  task  initialization  is  performed  se- 
quentially, the  :""  child  task  niu.st  wait  for  i  -  1  children  tasks  to  acquire  their 
identity  before  it  can  receive  its  own.  In  particular,  the  i  -|-  l'**  child  task  begins 
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executing  its  iteration  426  instructions  after  the  i""  task  has  begun  its  iteration, 
i.e. 

INTLpuii  +  1)  =  INTLpuii)  +  426. 

Therefore,  the  total  overhead  of  the  Ada  PLE  is  Hnearly  dependent  on  N  and 
is  given  by 

Opu  =  426  •  TV  +  870. 

We  shall  see  in  Section  6.5  that  this  fact  renders  the  Ada  PLE  inappropriate 
for  expressing  loop-level  parallehsm. 

6.3      Tree-Based  Initialization 

As  the  last  section  illustrates,  distributing  identities  sequenticJly  significantly 
increases  overhead.  An  alternative  method  of  task  initialization  is  to  have  each 
child  task  distribute  identities  to  other  tasks  in  a  tree-like  manner.  We  refer  to 
this  method  as  tree-based  initialization. 

As  shown  in  Figure  6.3,  this  scheme  replaces  the  N  entry  calls  formerly  in  the 
parent  task  with  one  entry  call  to  the  first  task  of  the  parJoop  array.  However, 
each  task,  T,  of  the  airray  is  now  responsible  for  distributing  iterates  to  two 
children  tasks. ^  We  refer  to  these  children  as  the  left  ajid  right  child  of  T.  To 
determine  these  tasks  a  "heapsort^-hke  method  is  used;  a  task  with  iterate  i 
has  descendants  with  iterates  2i  and  2i  -f- 1.  As  this  method  has  greater  built-in 
parallelism,  we  can  expect  a  logarithmic  overhead,  instead  of  a  linear  one. 


'Fiom  here  on  "children"  is  meant  in  the  sense  of  the  initialization  tree. 
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declare 

task  type  iteration_task  is 

entry  get.iterate  (iterate  :  in  integer); 
end  iteration  .task; 

parJoop  :  array  (l-.N)  of  iteration-task; 

task  body  iteration_task  is 

myJterate,  left.child  :  integer; 
begin 

accejjt  get_iterate(iterate  :  in  integer)  do 
myJterate  :=  iterate; 

end  get-iterate; 

—  Distribute  iterates  to  two  potential  children. 

left.child  :=  2  *  myJterate; 
if  left.child  <  l\l  then 

par-loop  (left-child). get.iterate  (left-child); 

if  left-child  +  1  <  N  then 

par  Joop  (left.child  +  l).getJterate  (left-child  +  1); 

end  if; 
end  if; 

end  iteration.task; 

begin  —  Mciin  only  initializes  the  first  task. 

par  Joop  (1 ). get  Jterate(l); 
end; 

Figure  6.3:  Tree-Based  Initialization  Variant  of  the  Ada  PLE 
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6.3.1      Initialization  Overhead 

We  now  focus  our  attention  on  the  overhead  required  to  initialize  N  tasks  using 
this  method.  After  all  tasks  have  been  initiated,  the  parent  task  calls  the  entry 
of  the  first  task.  Since  a  for  loop  is  no  longer  needed,  the  category  label  pre- 
entry  call  in  Table  6.2  is  reduced.  However,  the  instructions  required  to  perform 
the  entry  call  and  rendezvous  remain  the  same.  A  total  of  206  instructions  are 
required  for  the  first  task  to  receive  its  iterate,  and  to  reach  the  point  immedi- 
ately following  the  accept  body.  At  this  point  the  parent  task  no  longer  plays  a 
role  in  distributing  iterates. 

Since  tasks  in  the  array  are  activated  in  parallel,  we  assume  they  will  reach 
their  accept  statement  before  the  parent  task  reaches  its  entry  call.  The  in- 
structions associated  with  these  pre-initialization  actions  are  included  in  the 
instructions  required  for  the  parent  task  to  distribute  an  iterate  to  the  first 
task.  After  a  child  task  receives  its  iterate,  it  attempts  to  call  the  entries  of 
its  two  children.  In  Table  6.3,  we  categorize  the  instructions  executed  to  carry 
out  the  associated  rendezvous  into  three  phases:  pre-entry  call,  entry  call,  and 
rendezvous.  As  we  can  see,  326  and  327  instructions  are  required  to  initialize 
the  left  and  right  child,  respectively.  As  right  child  initialization  occurs  after  left 
child  initialization,  a  total  of  653  instructions  are  required  to  initialize  a  right 
child. 

We  now  determine  the  total  amount  of  instructions  required  to  initialize  N 
tasks  using  the  tree-based  approach.  Consider  a  child  task,  z,  with  two  children. 
We  define  INTLtrte{i)  to  be  the  number  of  instructions  required  for  task  i  to 
become  initialized  using  tree-based  initialization.    From  Table  6.3  we  see  that 
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Child 

Pre-Entry 
Call 

Entry 
Call 

Rendezvous 

Total 

Left 

170 

59 

97 

326 

Right 

171 

59 

97 

327 

Post-Children 

- 

- 

- 

6 

Total  Per  Iterate 

- 

- 

- 

659 

Table  6.3:  Per  Task  Initialization  Overhead  for  Tree-Based  Version 
task  2i  becomes  initialized  326  instructions  after  INTLtTee{i)i  i-e. 

INTLtree{2i)   =   I  NT  Ltree{i)  +  ^26. 

Likewise,  task  2i  -(-  1  is  initialized  327  instructions  after  INTLtree{2i)  or  653 
instructions  after  INTLtree{i)-  Note  that  while  task  i  is  performing  the  initial- 
ization of  task  2i  -I-  1,  task  2i  can  begin  the  initialization  of  its  children  tasks. 

Complete  Trees 

To  analyze  the  total  overhead  in  performing  these  initiahzations  we  first  consider 
the  case  when  the  initialization  of  A'  forms  a  complete  tree.  Thus,  let 

N  =  2    —  Ijfor  some  integer  fc  >  1. 

Consider  any  level  of  the  tree.  As  the  right  child  of  a  task  always  get  initicJized 
after  the  left  child,  it  follows  that  the  last  task  to  be  initiahzed  in  any  complete 
level  of  the  tree  is  the  rightmost  task.  Since  a  task  cannot  be  initiahzed  until 
its  parent  is  initialized,  the  last  task  to  be  initialized  in  a  complete  tree  of  size 
A^  is  the  rightmost  leaf  task  of  the  tree,  i.e.  tcisk  A^. 

To  determine  when  tcisk  N  is  initialized,  we  consider  what  must  precede  it. 
\Vc  have  already  seen  that  206  instructions  are  required  for  the  parent  task  to 
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initialize  the  root,  or  first  task  of  the  tree.  Next,  each  of  the  ancestors  of  the  N 
must  initialize  their  children.  Since  N  forms  a  tree,  there  are  [log  N\  ancestors  of 
N.  Each  ancestor  must  initiahze  two  children  which  comprises  653  instructions.^ 
Therefore,  an  additional  653  ■  [log  A^J  instructions  must  be  executed  before  tzisk 
N  is  initialized. 

Once  task  A^  has  been  initialized,  it  attempts  to  initialize  potential  chil- 
dren tasks.  Although  it  has  no  children  tasks  to  initialize,  it  nevertheless  has 
to  execute  57  instructions  to  determine  this  fact.  Thus,  the  total  number  of 
instructions  required  to  initiahze  N  tasks,  where  N  forms  a  complete  tree  is: 

INTLtree{N)    =    206  +  653  •  [log  iVJ  +  57 

=    653-  [logiVj  +263  (1) 

Incomplete  Trees 

We  now  compute  INTLtree{N),  where  the  initialization  of  these  N  tasks  does 
not  form  a  complete  tree.  Therefore,  we  assume 

iV  /  2*  -  l,for  all  integers  k>l, 

and  let  CTn  be  the  task  number  that  corresponds  to  the  rightmost  node  on  the 
last  complete  level  of  the  initicilization  tree  for  N  tasks,  i.e.  CTpf  =  2'-'°^^-'  —  1. 
Since  the  initiahzation  of  N  tasks  must  include  the  initialization  of  the  task 
numbered  CT^,  a  lower  bound  for  INTLtree{N)  exists: 

INTLtr..{CTs)    <    INTLtre.{N)  (2) 


^We  do  not  include  the  last  6  instiuctions  associated  with  the  initialization  of  a  child  task 
because  the  children  do  not  need  to  wait  for  these  instructions  to  be  executed  before  they  are 
initialized. 
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The  true  value  of  INTLtrtt{N)  is  a  function  of  how  much  of  the  leaf  level  is 
complete.  Instead  of  determining  this  value  we  derive  an  upper  bound  for  it, 
and  show  that  this  bound  is  quite  close  to  the  lower  bound  for  INTLtret{N). 

As  all  tasks  numbered  less  than  CTtv  are  initialized  when  we  initialize  CTff, 
any  task  that  would  make  INTLtret{N)  greater  than  INTLirec{CTj^)  must  be 
numbered  greater  than  CT^r,  i.e.  it  must  be  on  the  incomplete  leaf  level.  Con- 
sider one  of  these  leaf  tasks,  T.  As  we  have  seen,  the  time  at  which  T  becomes 
initialized  is  a  function  of  when  its  ancestors  become  initialized.  Left  children 
become  initialized  after  their  parents  are  initialized.  However,  as  right  children 
must  wciit  for  both  their  parent  to  become  initialized  and  for  the  parent  to  ini- 
tiahze  its  left  child,  right  children  ancestors  of  T  must  wait  the  longest.  Since 
task  CTjs/  has  to  wait  for  [logTVJ  left  and  right  children  initializations,  any  leaf 
task  that  would  make  the  value  of  INTLtret{N)  greater  than  INTLtr^i^TN) 
must  have  to  wait  for  more  than  [log  CTn\  left  and  right  children  initializations. 

As  task  T  is  at  level  [logTJ,  it  must  wait  for  [logTJ  left  initiahzations  or 
326-  [logTJ  instructions.  Since  [logTJ  =  [log  CTn\  -f  1,  at  most  326  additional 
instructions  can  be  attributed  to  left  initializations.  As  right  initicdizations 
require  one  more  instruction  than  left  initializations,  in  order  for  INTLtrte{T)  to 
be  greater  than  INTLtree{CTN),  at  least  [logCT^J  right  children  initiahzations 
must  occur  via  the  ancestors  of  T. 

Only  one  task  at  level  [logTJ  requires  [logTJ  right  initializations:  the  right- 
most task  at  level  [logTJ.  As  the  initicdization  of  A^  tasks  docs  not  form  a  com- 
plete tree,  [logTJ  right  initizdizations  cannot  take  place.  Therefore,  the  most 
delay  task  T  can  exhibit  is  by  waiting  for  [logNJ  (=  [logTJ)  left  initiahzations 
and  [logTVJ  —  1  right  initializations.  Since  a  left  and  right  initiahzation  require 
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326  and  327  instructions,  respectively,  we  have  the  following  upper  bound  for 
the  number  of  instructions  needed  to  initialize  N  tasks  in  an  incomplete  tree: 

INTLt„,{N)    <    206-f  326-[logiVJ  +327-([logA^J  -l)  +  57         (3) 
=    653-  [logiVJ  -64  (4) 

The  first  term  in  (3)  corresponds  to  the  parent  task  initializing  task  number  1. 
The  second  and  third  terms  represent  the  most  left  and  right  child  initializa- 
tions that  can  occur,  respectively.  The  last  term  corresponds  to  the  number  of 
instructions  that  a  leaf  node  executes  to  determine  that  it  has  no  children  to 
initialize. 

Combining  (2)  and  (4),  we  have  the  following  inequality: 

653  •  [logA^J  -  384  <  INTLtr»{N)  <  653  •  [logiVJ  -  64 

As  expected,  the  initiahzation  overhead  for  the  tree-base  PLE  is  logarithmic  in 
N .  As  the  lower  and  upper  bounds  for  INTLtree  are  relatively  close,  we  choose 
to  use  the  lower  bound  in  our  comparisons  of  Section  6.5. 

6.4     The  GenJVLinitask  Idiom  {GMI) 

In  this  section  we  describe  the  overhead  associated  with  both  versions  of  the 
GMI.  As  expected,  this  overhead  is  substantially  less  than  the  overhead  asso- 
ciated with  the  Ada  PLE  as  described  in  the  previous  sections.  The  reduction 
in  overhead  is  due  to  a  number  of  factors: 

•  Our  implementation  does  not  need  to  elaborate  the  types  eind  objects 
required  for  the  Ada  PLE. 
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•  Our  implementation  combines  task  creation  with  task  activation  and  per- 
forms these  combined  operations  in  parallel  with  other  creation/activation 
operations. 

•  Our  implementation  reduces  the  number  of  instructions  needed  to  create 
each  iteration  task  by  reducing  the  status  required  for  each  of  these  tzisks. 

•  Our  implementation  distributes  unique  iterates  to  each  task  in  parallel. 

Although  all  four  of  these  factors  contribute  to  the  reduction  in  overhead,  the 
last  factor  plays  the  most  significant  role. 

In  the  next  two  sections  we  elaborate  on  these  factors  by  providing  the  num- 
ber of  instructions  that  are  required  to  perform  minitask  initiation  and  initializa- 
tion. This  is  followed  by  Section  6.5  where  we  compare  these  numbers  against 
the  corresponding  values  for  both  versions  of  the  Ada  PLE. 

6.4.1      Minitask  Initiation 

In  this  section  we  describe  the  amount  of  overhead  that  is  required  to  initiate  a 
family  of  minitasks.  The  figures  we  present  include  the  elaboration  of  the  sub- 
type and  procedure  body,  in  addition  to  performing  instantiation  of  the  generic 
package.  We  divide  this  initiation  into  the  following  three  categories: 

Pre-Minitask  Creation:   Actions  that  are  performed  before  the  creation  of  any 
minitasks.  These  correspond  to: 

•  Elaborating  the  subtype, 

•  Elaborating  the  loop  body  procedure,  and 
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•  Passing  these  objects  as  parameters  to  the  generic  instantiation. 

Minitask  Creation/Activation:  This  category  represents  all  actions  involved 
with  the  creation  and  activation  of  minitasks.  Since  minitask  creation  and 
activation  are  combined,  this  category  includes  actions  that  are  the  se- 
mantic equivalent  to  task  creation,  post  task  creation,  and  task  activation 
in  the  Ada  PLE.  It  includes  the  foDowing  actions: 

•  The  parent  task  places  the  appropriate  number  of  minitask  items  on 
to  the  parallel  queue. 

•  Each  of  these  items  is  removed  by  a  processor  and  a  minitask  is  cre- 
ated. 

•  The  declarative  part  of  each  minitask  is  elaborated. 
Post  Execution:  This  category  involves: 

•  reclaiming  storage  for  each  of  the  minitasks, 

•  exiting  the  package's  stack  frames, 

•  unblocking  the  parent  task,  and 

•  the  parent  task  exiting  the  stack  frame  of  the  block  statement. 

Table  6.4  specifies  the  number  of  the  instructions  that  are  performed  to 
complete  each  of  these  categories.  Although  these  initiation  figures  offer  a  better 
than  50%  improvement  over  the  corresponding  figures  for  the  Ada  PLE,  the 
actual  performance  may  be  even  better: 
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Action 
Performed 

Number  of 
Instructions 

Percentage 
OF  Total 

Pre-Minitask  Creation 

149 

36% 

Minitask  Creation/Activation 

198 

48% 

Post  Execution 

66 

16% 

Total 

413 

100% 

Table  6.4:  Initiation  Overhead  for  a  Family  of  Minitasks 

Recall  that  in  order  to  initiate  tasks  in  the  Ada  PLE  a  synchronization  point 
exists  between  task  creation  and  task  activation.  To  implement  this  synchro- 
nization point,  a  "join"  cind  subsequent  "fork"  must  take  place.  If  a  processor 
executing  a  thread  is  delayed  before  the  join  occurs,  the  fork,  and  subsequently 
all  threads  will  be  delayed.  Although  our  analysis  accounts  for  the  instructions 
that  are  executed  to  perform  the  join  and  fork,  it  does  not  account  for  the 
possible  delays  in  performing  this  join. 

In  the  GMI  task  creation  and  activation  are  combined;  no  synchronization 
point  exists.  Therefore,  delays  of  this  type  only  affect  the  task  executing  on  the 
processor  that  exhibits  the  delay,  not  all  tasks. 


6.4.2      Minitask  Initialization 

In  this  section  we  analyze  the  number  of  instructions  each  minitask  executes  to 
receive  its  unique  identity  or  iterate.  This  initialization  must  occur  before  a 
minitask  can  begin  to  execute  the  sequence  of  statements  of  the  procedure  that 
corresponds  to  the  body  of  the  parallel  loop.  The  overhead  required  for  our 
idiom  is  substantially  less  than  the  overhead  associated  with  the  initiaJization 
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task  body  minitask  is 
iterate  :  iteration_range; 
procedure  record-exception  (exception.raised  :  exception-type)  is 

end  record_exception; 

begin 

iterate  :=  iteration-range(iteration.counter.faa  (1)); 
loopJ)ody  (iterate); 

exception 

end  minitask; 

Figure  6.4:  The  Body  of  a  Minitask 

phase  of  both  versions  of  the  Ada  PLE. 

Consider  the  body  of  the  minitask  as  shown  in  Figure  6.4.  Once  a  minitask  is 
activated,  it  performs  two  operations  before  the  sequence  of  statements  of  the 
parcdlel  loop  are  executed;  it  calls  the  faa  function  to  obtciin  its  iterate,  and  then 
it  calls  the  procedure,  loop-body,  to  execute  the  corresponding  iteration. 

The  faa  function  ceJl  is  implemented  in  a  highly  optimized  manner;  multiple 
faa  requests  are  performed  in  parallel.  Moreover,  as  described  in  Chapter  4,  we 
acqmre  the  value  of  the  iterate  for  free;  we  do  not  need  to  execute  any  additional 
instructions  to  obtain  this  value.  As  we  have  seen  in  Sections  6.2.2  and  6.3.1, 
this  is  a  substantial  savings  over  the  corresponding  method  used  to  distribute 
iterates  to  each  task  of  the  Ada  PLE. 

We  implement  the  procedure  call  to  loop_body  in  the  usual  manner.  At  first 
glance,  one  may  wish  to  remove  the  procedure  call  by  inserting  the  code  of  the 
procedure  into  the  minitask  body  itself.  However,  recall  that  by  allowing  the 
user  of  our  idiom  to  specify  the  body  of  the  loop  using  a  procedure,  we  provide 
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Action 
Performed 

Number  of 
Instructions 

faa  Function  Call 

0 

loop.body  Function  Call 

38 

Table  6.5:  Initialization  Overhead  for  each  Minitask 

increased  functionality  not  normally  associated  with  parallel  loops.  Namely,  the 
ability  to  create  local  objects,  be  it  variables,  subprograms,  or  tasks.  For  this 
reason,  we  cannot  "inline"  the  loop.body  procedure.  In  Table  6.5,  we  specify  the 
number  of  instructions  required  to  implement  the  faa  function  call  and  the  call 
to  the  loop  body  procedure. 

We  now  consider  the  combined  overhead  required  before  a  minitask  begins 
its  execution.  From  Table  6.4,  we  see  that  413  instructions  are  required  for 
minitask  initiation.  Once  a  minitask  has  been  initiated,  it  executes  an  additional 
38  instructions  during  its  initialization.  Therefore  a  total  of  451  instructions  are 
executed  before  each  minitask  begins  its  execution. 

6.5      Comparing  Parallel  Loop  Idioms 


In  this  section  we  compare  the  overhead  requirements  of  the  gen_minitask  idiom 
with  those  of  both  versions  of  the  Ada  PLE.  As  expected  our  idiom  compares 
favorably.  We  discuss  issues  of  granularity  and  efficiency  for  each  of  the  parallel 
loops  idioms. 

Table  6.6  specifies  the  overhead  requirements  for  each  form  of  parallel  loop. 
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Action 

Ada  PLE 

Tree-Based  Ada  PLE 

GMI 

Initiation 

975 

975 

413 

Initicdization 

426 -TV -105 

653-  [logN]  -384 

38 

Total 

426  •  iV  +  870 

653-  [logiVJ  +591 

451 

Table  6.6:  Overhead  Requirement  for  Each  Parallel  Loop  Idiom 

It  shows  that  task  initiation  is  reduced  by  over  50%  using  the  GMI.  Further- 
more, a  more  significant  savings  is  found  in  the  initialization  process.  The  GMI 
requires  a  constant  number  of  instructions  regardless  of  the  number  of  iterations 
to  be  executed,  where  as  both  versions  of  the  Ada  PLE  require  a  number  of  in- 
structions dependent  on  the  number  of  iterations  of  the  parallel  loop.  Although 
the  tree-based  version  of  the  Ada  PLE  replaces  a  hnear  term  with  a  logarithmic 
one,  it  is  nevertheless  still  dependent  on  N .  This  fact  combined  with  the  high 
initiation  requirements  required  for  both  versions  of  the  Ada  PLE  renders  them 
inappropriate  in  obtaining  efficient  loop-level  parallehsm. 


6.5.1      Granularity 

Once  the  overhead  for  a  parallel  loop  construct  has  been  determined,  its  granu- 
larity can  be  computed.  The  previous  section  identified  the  amount  of  overhead 
required  for  the  GMI  and  the  two  versions  of  the  Ada  PLE.  In  this  section  we 
compute  the  granularity  for  all  three  versions  of  parallel  loops. 

Before  we  discuss  granularity,  we  define  two  terms,  SEQ  and  PARc.  The 
former  corresponds  to  the  number  of  instructions  required  to  execute  a  loop 
sequentially,  while  the  latter  represents  the  number  of  instructions  needed  to 
execute  the  loop  in  parallel  using  parallel  loop  construct  c. 
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Definition  1  Let  A^  be  the  number  of  iterations  of  a  loop,  S  be  the  average 
number  of  instructions  executed  by  each  iteration,  and  0„,  be  the  iteration 
overhead  associated  with  a  sequential  loop.  The  number  of  instructions  required 
to  execute  the  loop  sequentially  is  given  by 

SEQ  =  N  -{5  +  0„,).  (5) 

Since  two  C  instructions  are  needed  to  increment  and  test  the  index  variable 

of  loop,  0,eq   =  2. 

Definition  2  Let  Oc  represent  the  overhead  of  a  parallel  loop  construct  c  and  S 
be  the  average  number  of  instructions  executed  by  each  iteration.  The  number 
of  instructions  required  to  execute  the  loop  in  parallel  using  c  is 

PAR,  ^Oc  +  S.  (6) 

In  order  to  determine  the  granularity  of  a  particular  parallel  construct  we 
first  determine  the  point  at  which  the  parallel  construct  reduces  the  execution 
time  as  compared  with  the  sequential  execution,  i.e.  at  what  point  is  it  beneficiaJ 
to  utihze  this  construct.  We  define  this  point  formcilly. 

Definition  3  Let  c  be  a  parallel  loop  construct.  We  define  LBGc,  Lower  Bound 
Granularity  of  c,  to  be  the  minimum  average  number  of  instructions  executed  by 
each  iteration  such  that  the  number  of  instructions  executed  using  c  (a  parallel 
construct)  is  equal  to  the  number  of  instructions  executed  by  the  sequential 
version  of  the  loop. 

By  computing  the  LBGe  for  a  loop  construct  c,  we  obtain  a  lower  bound  for 
this  construct;  any  use  of  c  with  greater  than  this  number  of  instructions  induces 
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the  running  time  of  the  program.  To  compute  LBGc  we  need  to  find  the  value 
of  5  such  that 

SEQ  =  PARc. 

Substituting  (5)  and  (6),  and  solving  for  S  we  get  the  following: 

Oc-2N 

^^^^    ^        N-l  ^'^ 

As  the  unit  of  measure  for  LBGc  is  the  average  number  of  instructions  in  an 
iteration,  it  is  bounded  below  by  0.  Thus,  the  formal  definition  of  LBGc  is: 


LBGc 


%f^    ifO,  >2-A^ 

(8) 

0  otherwise 


Equation  8  provides  a  lower  bound  grain  size  for  any  parallel  loop  construct. 
In  particular,  it  can  be  applied  to  our  three  parallel  loop  constructs.  In  this 
way  we  can  determine  the  minimum  number  of  instructions  each  must  execute 
before  any  savings  due  to  parallelism  is  reaUzed. 

Computing  LBGpu 

First  we  consider  the  Ada  PLE  construct  with  serial  initialization.  Recall  from 
Table  6.6  that  Opu  =  i26  ■  N  +  870.  Therefore, 

LBG,.    =    l?i^^^  (9) 

For  large  values  of  A'^  we  find: 


ta  iai_^I±l!0    =    424. 
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Regardless  of  the  level  of  paraJlelism  of  our  machine,  the  average  size  of  each 
iteration  must  be  greater  than  424  instructions  in  order  for  the  execution  time 
to  be  reduced.  As  this  number  is  rather  large,  we  can  see  why  the  Ada  task  is 
considered  to  be  a  coarse  grain  construct  not  amenable  to  loop-level  paraJlelism. 
This  is  a  direct  consequence  of  the  serial  bottleneck  that  is  present  in  distributing 
iterates  to  each  task. 

Computing  LBGtree 

We  next  compute  LBGtrce,  the  lower  bound  granularity  for  the  Ada  PLE  using 
the  tree-based  initieJization.  As  in  the  case  of  the  serial  Ada  PLE  we  use  (8) 
to  compute  this  value.  Since  Otrec  =  653  •  [logiVJ  +  591,  for  values  of  N  where 

653  •  [log  iVJ  +  591  >  2  •  A'  we  have 


LBGtree       =       * 


653|logNj4-591-2.N     if  653  •  I  log  iVJ  +  591  >  2  ■  A^ 
0  otherwise 


Forsufficientlylarge  Ar,653-[logArj-|-591  <  2- A^,  thereby  making  IBGtr«  = 
0.  This  fact  confirms  our  intuition  that  as  A^  becomes  large,  the  tree-based 
version  of  initicJizing  the  Ada  PLE  is  more  efficient  than  its  seried  counterpart; 
the  required  granularity  per  processor  is  zero.  However,  we  shall  see  that  it  still 
does  not  provide  a  means  of  realizing  efficient  loop-level  parallelism. 
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Computing  LBGgm 

Lastly,  we  compute  LBGgm,  the  lower  bound  granularity  for  the  GMI.   Since 
Ogr,,  =  451,  we  have 

'  ^^^xfr^    if  A'  <  225 

(11) 

0  if  iV  >  226 

The  value  of  LBGgm  is  also  equal  to  zero  as  N  gets  larger.  This  is  as  expected; 
as  the  amount  of  parallelism  increases,  the  required  grain  size  that  is  needed  to 
offset  the  overhead  decreases.  Although  the  limits  of  LBGgm  and  LBGtref  are 
equal,  the  GMI  out  performs  the  tree-based  version  of  the  Ada  PLE. 
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Figure  6.5:  LBGj,u{--),  I5G«.„(  — ),  and  LBGgm{—) 


In  Figure  6.5.  we  plot  the  v-alues  oiLBGpi,.  LBGtret  and  LBGgm-  Each  curve 


CHAPTER  G.    PERFORMANCE  ANALYSIS  157 

illustrates  for  a  given  number  of  iterations  the  number  of  instructions  that  axe 
required  for  parallel  execution  to  be  beneficial.  All  coordinates  that  lie  below  the 
curve  correspond  to  Vcdues  of  the  iteration  size  and  number  of  iterations  where 
the  parallel  construct  is  not  beneficial.  Likewise,  all  coordinates  above  the  curve 
show  \-alues  where  the  pcirallel  construct  out  performs  sequential  execution. 

This  figure  illustrates  that  the  gen-minitask  idiom  is  the  most  general  in 
that  it  requires  both  a  snicJler  number  of  iterations  and  iteration  size  than 
the  other  two  idioms.  For  example,  when  A'  =  100.  the  Ada  PLE  and  its 
tree-based  version  require  an  average  iteration  size  of  greater  than  437  and  43 
instructions,  respectively,  to  justify  their  use.  while  the  GMI  only  requires  am 
average  iteration  size  of  2.5  instructions.  While  it  is  certainly  possible  for  a 
parallel  loop  to  contain  an  average  iteration  size  of  43  instructions,  this  threshold 
is  rather  high  considering  this  is  the  first  point  at  which  an  increzised  performance 
is  obtained. 

While  this  metric  shows  that  the  GMI  is  much  more  efficient  them  the  two 
versions  of  the  Ada  PLE.  it  more  importantly  illustrates  that  the  granularity 
of  our  idiom  is  small;  for  all  but  the  smallest  loops,  our  idiom  is  superior  to  a 
sequential  execution. 

6.5.2      Efficiency 

The  previous  section  describes  the  lower  bound  granularity  required  to  offset 
the  overhead  sissociated  with  a  pcu-allel  loop  construct.  This  metric  is  useful 
in  determining  when  it  is  beneficial  to  use  a  parallel  loop  construct.  However, 
most  ajjplications  that  employ  parallelism  do  so  with  the  expectation  of  not  only 
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reducing  the  running  time  of  their  program,  but  reducing  it  in  proportion  to  the 
amount  of  parallchsm  that  is  present.  In  this  section  we  exaxnine  each  parallel 
loop  construct  in  this  light;  we  determine  how  efficient  they  are  in  achieving 
their  optimal  performance. 

Before  we  discuss  efficiency  we  define  the  speedup  of  parallel  loop  construct 
c,  SPc,  in  the  usual  manner.  That  is,  it  is  the  ratio  of  the  number  of  executed 
instructions  of  the  sequential  version  to  the  number  of  executed  instructions  of 
the  parallel  version,  i.e. 

^^'      PAR,-  ^     ' 

Speedup  is  useful  in  meaisuring  how  much  faster  a  parallel  construct  is  com- 
l)ar(>d  to  the  sequential  version.  A  speedup  greater  than  one  signifies  that  the 
parallel  version  is  faster  thcin  the  sequentieil  version,  a  desirable  result.  How- 
ever, this  metric  does  not  convey  how  much  parallelism  is  required  to  achieve 
this  value.  An  application  with  a  speedup  of  five  using  five  tasks  should  be 
distinguished  from  an  appHcation  with  a  speedup  of  five  using  ten  tasks.  As  the 
second  application  needs  five  more  tasks  to  achieve  the  seime  speedup,  it  is  not 
as  efficient.'' 

To  capture  how  efficient  an  application  is  in  utihzing  its  resources,  the  metric, 
efficiency  is  defined.  E,  is  the  ratio  of  the  speedup  for  paxEdlel  loop  construct  c 
over  the  number  of  tasks  that  are  used,  i.e. 

SPc 


E.    = 


SEQ 
_       PAR, 

N 


'In  this  discussion  we  assume  all  tasks  are  being  used  productively. 
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N-(5-f2) 
_         Oc  +  S 

N 

(13, 


0,-^S 

As  Oc  is  usually  greater  than  2,  efficiency  ranges  from  0  to  1.    Solving  (13) 
for  S,  we  get: 

5  +  2 


Oc  +  S 


=    E. 


E     O  -'> 

Efficiency  of  the  Ada  PLE 

Equation  13  specifies  that  the  efficiency  of  a  parallel  loop  construct  is  a  function 
of  its  overhead  Eind  the  size  of  each  iteration.  Applying  this  equation  to  the  Ada 
PLE.  we  get: 

5-1-2 
^"'^    "    42G  •  .V  +  870  +  5  ^^^^ 

The  efficiency  of  the  Ada  PLE  is  a  function  of  the  niimber  of  threads  and  the  size 
of  each  thread.  Since  the  denominator  of  (15)  contedns  a  rather  large  constant 
factor  (870)  and  also  a  large  factor  of  N  (42C),  it  takes  a  significant  thread  size 
to  get  the  efficiency  close  to  one. 

To  get  a  feel  for  the  significance  of  this  denominator,  we  use  (14)  to  compute 
the  values  of  5  for  particular  \-cdues  of  A'  that  eire  needed  to  achieve  an  efficiency 
of  25'/?  (see  Figure  C.C).  In  order  to  obtain  this  level  of  efficiency,  the  size  of 
each  thread  must  increases  rapidly  as  A'  gets  larger.  Once  again  this  is  a  result 
of  the  serial  bottleneck  that  is  required  in  distributing  iterates  to  each  task. 
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Figure  0.6:  25'X  Efficiency  for  Epie{-  ■  •),  Etrtt{ ),  and  Egmi — ) 

Efficiency  of  the  Ada  PLE  With  Tree-Based  InitiaHzation 

Ai)i)lyiiig  (13)  to  the  tree-base  version  of  the  Ada  PLE,  we  get: 

5  +  2 
^'"'    "    653-  [log  A- J  +591  +  S  ^"^^^ 

As  is  the  case  with  the  serial  version  of  the  Ada  PLE,  the  efficiency  of  the 
tree-ba-sed  version  is  a  function  of  the  number  of  threads  and  the  size  of  each 
thread.  Note  that  a  linear  term  has  been  replace  by  a  logaritlmiic  one. 

Once  again  we  use  (14)  to  compute  the  values  of  S  that  is  needed  to  obtain 
an  efficiency  rate  of  25%  (Figure  6.6).  We  compare  these  %'alues  with  those  of 
the  serial  version  of  the  Ada  PLE.  As  expected  the  average  thread  size  required 
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for  the  tree-base  PLE  out  performs  the  serial  version. 

Efficiency  of  the  GMI 

hi  this  section  we  apply  (13)  to  the  gen.minitask  construct  and  get: 

Comparing  Eg^  with  Epu  and  Etrte,  we  see  two  major  differences: 

1.  Egm  does  not  depend  on  N. 

2.  Eg,„  has  a  lower  constemt  factor  associated  with  its  overhead. 

These  two  differences  substantially  reduce  the  size  of  each  thread  that  is  needed 
to  obtain  a  particular  level  of  efficiency.  Since  Egm  does  not  depend  on  A',  we 
use  (14)  to  solve  for  S. 

We  use  ( IS )  with  Egm  =  25%  to  compute  the  required  \-alue  of  5.  We  contrast 
this  value  which  is  constant  for  all  \-alues  of  A'  with  the  values  of  the  other  two 
idioms  (Figure  6.C). 

In  Table  C.7  we  use  (18)  to  specify  the  size  of  each  thread  that  is  needed 
to  achieve  various  levels  of  efficiency.  From  this  table  we  see  that  obtaining  an 
efficiency  of  SO'X  with  the  gen.minitask  requires  451  instructions.  Although  this 
value  may  seem  rather  large,  the  GMI  is  able  to  achieve  reeisonable  levels  of 
efficiency  with  small  thread  sizes,  something  both  versions  of  the  Ada  PLE  are 
unable  to  do. 
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Efficiency 

Average  Thread  Size 

Ada  PLE 

Tree-Based  PLE 

Gen_Minitask 

5% 

22.4  •  N  +  43.7 

34.4-  [logA^J  +29 

21.6 

10% 

47.3- A^  + 94.4 

72.6-  [logiVJ  +63.4 

47.9 

20% 

106.5  •iV  + 213 

163.3-  [logA^J  +143.3 

110.3 

30% 

182.6  ■  N  +  370 

279.9-  [logA^J  +250.4 

190.4 

40% 

284 -TV +  576.7 

435.3-  [logA^J  +390.7 

297.3 

50% 

426  •  iV  +  866 

653-  [logA^J  +587 

447.0 

Table  6.7:  Average  Thread  Size  Required  for  Various  Efficiency  Levels 
A  Comparison  with  IBM  VS  Fortran 

Another  way  of  evaluating  the  efficiency  of  our  idiom  is  to  compare  its  overhead 
with  the  overhead  of  a  parallel  loop  from  a  different  language.  We  have  see 
that  our  idiom  requires  451  C  instructions  before  all  minitasks  begin  executing. 
As  a  comparison  we  give  the  corresponding  figures  for  VS  FORTRAN  [IBM89]  as 
estimated  by  Cytron  [Cyt9l]. 

The  Parcdlel  Loop  construct  in  VS  FORTRAN  executed  on  a  3090  requires: 

•  58  machine  instructions  to  initiate  a  parcdlel  loop, 

•  290  machine  instructions  to  initialize  each  iteration  thread, 

•  5  machine  instructions  (performed  in  parallel)  for  each  thread  commence, 

•  58  machine  instructions  to  terminate  the  loop  construct. 


Thus,  121  +  290  *  N  instructions  are  required  to  execute  a  parallel  loop  in  VS 
FORTRAN  on  a  3090,  where  N  is  the  number  of  processors  assigned  to  execute 
the  loop.  Note  that  the  number  of  processors  assigned  to  a  loop  isn't  necessarily 
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the  same  as  the  number  of  iterations.  This  implies  a  form  of  chunking  is  present. 
Chunking  was  discussed  in  Chapter  5  and  is  addressed  in  a  different  context  in 
the  next  section. 

Note  that  the  unit  of  measure  for  VS  FORTRAN  is  a  machine  instruction 
rather  than  a  C  instruction  that  was  used  for  the  genjninitask  idiom.  With  this 
in  mind,  it  is  clear  that  the  complete  constant  overhead  for  VS  FORTRAN  is  lower 
than  the  gen_minitask  idiom  (121  machine  instructions  vs.  413  C  instructions). 
However,  it  is  aJso  clear  that  the  cost  for  initializing  each  iteration  is  much  less 
with  the  genjninitask  idiom  than  it  is  with  VS  FORTRAN  (38  C  instructions  vs. 
290  *  N  machine  instructions).  As  the  latter  comparison  is  more  critical,  we 
conclude  that  our  idiom  compares  favorably  with  the  parallel  loop  construct  of 
VS  FORTRAN. 

Further  Enhancements 

In  addition  to  providing  a  useful  metric  for  the  performance  of  a  parallel  con- 
struction, the  efficiency  formula  can  also  be  used  as  a  guide  to  increase  system 
performance.  For  example,  let's  consider  the  equation  for  Egm- 

5-2 
*"     ~     451  +  S 

As  we  can  see,  Egm  is  a  function  of  5",  the  average  size  of  loop  body.  By 
increasing  this  size,  we  can  improve  the  efficiency  of  the  GMI.  Furthermore,  we 
wish  to  increase  5,  without  increasing  the  total  amount  of  computation  to  be 
performed. 

One  way  of  achieving  this  is  to  assign  more  than  one  iteration  to  a  minitask. 
Using  this  scheme  each  minitask  will  execute  an  average  loop  body  of  c-5,  where 
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c  is  the  chunking  factor  (the  number  of  iterations  per  processor).  This  enhance- 
ment increases  the  size  of  the  body  of  the  loop,  without  increasing  the  total 
computation,  and  thereby  increases  the  efficiency  of  our  idiom.  Furthermore, 
the  number  of  minitasks  required  is  reduced  by  c,  freeing  processors  for  other 
uses. 

This  enhancement  can  be  performed  either  by  the  programmer,  sometimes 
called  loop  coalescing,  or  by  the  implementation,  often  referred  to  as  chunking. 
The  matrix  multiplication  example  in  Appendix  A  specifies  an  example  of  loop 
coalescing.  For  more  details  on  chunking,  see  the  work  by  Flynn  and  Flynn 
[FH90]  and  Kruskal  and  Weiss  [KW85]. 

6.6      Other  Types  of  Loops 

Up  to  this  point  we  have  dealt  solely  with  parallel  loops  with  no  synchronization 
requirements.  In  this  section  we  discuss  how  loops  that  require  synchronization 
among  iterations  can  be  handled  by  our  idiom  and  other  methods. 

The  first  loop  in  Figure  6.7  gives  aji  example  of  a  loop  that  cannot  be  executed 
in  parcdlel  without  the  presence  of  synchronization;  values  computed  in  iteration 
i  are  used  in  iteration  i  +  1.  In  order  to  execute  this  loop  in  parallel,  explicit 
synchronization  is  required;  iteration  i  must  signal  iteration  i  +  1  after  it  stores 
the  VcJue  in  A{i).  Likewise,  iteration  i  -f  1  must  wait  for  this  signal  before  it 
may  use  the  value  of  A{i  —  1).  Although  synchronization  for  this  loop  can  be 
specified  in  Ada,  (see  Figure  6.7),  the  resulting  program  is  inefficient.  Since  most 
of  iteration  i  must  be  executed  before  iteration  z  +  1  is  executed,  any  increase  of 
performcince  due  to  paraJleUsm  is  offset  by  the  additional  overhead  required  to 
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for  i  in  1..N  loop 

A(i):=  2  •  A(i-l); 
end  loop; 

task  body  iteration  is 

i  :  integer; 
begin 

i  :=  get-iteration;  —  A  function  that  gets  the  iterate 
if  i  /=  1  then  —  First  iterate  need  not  wait 
accept  signal  do  —  Wait  for  previous  iterate 

null  ; 
end  signal; 
end  if; 

A(i):=  2*  A(i-l); 
if  i  /=  N  then  —  Last  iterate  has  no  one  to  signal 

parJoop(i  +  l). signal;  —  Signal  next  iterate 
end  if; 
end  iteration; 

Figure  6.7:  A  Non-Parallel  Loop  with  Synchronization  Inserted 

perform  this  synchronization.  Therefore,  we  suggest  a  sequential  execution  for 
loops  of  this  type. 

However,  some  seemingly  sequential  loops  can  be  executed  in  parallel  with 
the  help  of  specicd  transformations.  In  the  remainder  of  this  section  we  discuss 
three  types  of  transformations.  The  transformations  we  discuss  have  been  de- 
veloped by  the  fields  of  vectorizing  and  parallelizing  compilers,  compilers  that 
attempt  to  automatically  convert  seriaJ  loops  into  parallel  ones.''  Although  these 
transformation  are  predominately  performed  on  Fortran  programs,  most  apply 
to  Ada  as  well.  As  is  the  case  with  chunking,  these  transformations  can  either  be 
performed  manually  by  the  programmer  or  automatically  by  the  compiler.  The 
latter  approach  requires  incorporating  sophisticated  dependence  analysis  into  a 


^Although    vectorizing    compilers    deal    only    with    loops,     some    paraJleliiing    conipiJeis 
[BCF'*'88,CHH89]  also  focus  on  other  areas  of  the  program  to  find  parallelism. 
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compilation  system.  Since  dependence  analysis  is  beyond  the  scoop  of  this  work 
we  refer  the  interested  reader  to  [Hin88,BC86]  for  a  more  detailed  discussion. 

6.6.1  Scalar  Expansion 

The  most  useful  transformation  that  can  be  performed  to  parallelize  a  loop  is 
scalar  expansion.  Consider  the  two  loops  shown  in  Figure  6.8.  In  the  first 
loop  arrays  A  and  B  are  being  interchanged,  element  by  element.  Since  the 
scalar,  temp,  is  reused  during  each  iteration,  it  appears  that  this  loop  is  not 
parallehzable.  However,  notice  that  no  values  stored  in  temp  during  one  iteration 
flow  to  any  other  iteration.  Therefore,  if  each  iteration  is  allocated  its  own  temp 
variable,  this  loop  can  be  executed  in  parallel.  In  the  second  part  of  Figure  6.8 
we  show  the  resulting  transformation  and  incorporate  it  wdth  the  GMI. 

6.6.2  Loop  Interchange 

Another  useful  transformation  is  loop  interchange.  Recall  from  Chapter  5  that 
when  two  nested  loops,  one  parallel,  one  sequential,  surround  a  section  of  code 
it  is  better  to  execute  the  outer  loop  in  parallel,  and  the  inner  loop  sequentially 
on  a  multiprocessor.^  The  primary  reason  for  this  is  to  reduce  the  number 
of  fork-join  points,  while  also  increasing  the  size  of  the  outer  loop  as  done  in 
chunking. 

Consider  the  nested  loops  in  the  first  part  of  Figure  6.9.  Due  to  the  flow  of 
values  in  the  i  loop,  it  must  be  executed  sequentially.  However,  bls  no  vailues  flow 


^On  a  vector  machine  the  opposite  is  true.  Since  a  parallel  loop  cjin  correspond  to  a  vector 
operation,  we  desire  these  loops  to  enclose  basic  instructions  only;  a  sequential  loop  enclosed  by 
a  parallel  loop  is  not  beneficial  on  a  vector  machine. 
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for  i  in  1..IM  loop 

temp  :=  A(i);  —  Reuse  of  temp  impedes  pajcillelization 

A(i)  :=  B(i); 

B(i)  :=  temp; 
end  loop; 

declare 

subtype  swap.range  is  integer  range  1..N; 
procedure  loop_body  (i  :  swap. range)  is 

temp  :  item;  —  Give  each  iteration  its  own  copy 
begin 

temp  :=  A(i); 

A(i)  :=  B(i); 

B(i)  :=  temp; 
end  loop. body; 

package  swap  is  new  gen.minitask  (swapj-ange,  loopJbody); 
begin 

if  not  swap. success  then 

—  Hjindle  error  condition  appropriately. 
end  if; 
end; 

Figure  6.8:  An  Example  of  Scalar  Expansion 
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for  i  in  1..M  loop 

for  j  ill  1..N  loop 

A(i,j):=  A(i-l.j)»  A{i  +  l.j); 

end  loop; 
end  loop; 

declare 

subtype  j_range  is  integer  range  1..N; 
procedure  j.body  (j  ;  j.range)  is 
begin 

for  i  in  1..M  loop 

A(l,j):=  A(i-l.j)*A(i+l,j); 
end  loop; 
end  loop-body; 

package  j -loop  is  new  gen.minitask  (j.range,  j.body); 
begin 

if  not  j  .loop. success  then 

—  Handle  error  condition  appropriately. 
end  if; 
end; 

Figure  6.9:  An  Example  of  Loop  Interchange 

among  iterations  of  the  j  loop,  it  can  be  executed  in  parallel.  Although  we  could 
execute  the  inner  j  loop  in  parallel  using  the  GMI,  it  will  be  more  beneficial  by 
having  the  i  loop  nested  inside  of  the  j  loop  as  shown  in  the  second  loop  part  of 
Figure  6.9.  Therefore,  we  interchange  the  i  and  j  loop. 


6.6.3      Loop  Alignment 

The  last  transformation  we  consider  deals  with  loops  that  appear  to  exhibit  a 
flow  of  Vcdues  among  iterations  as  shown  in  Figure  6.10.  In  the  first  loop  we  see 
that  the  value  stored  in  A(i)  during  the  f""  iteration  is  used  in  the  next  iteration. 
This  appears  to  render  this  loop  unparalleHzable.  However,  since  this  is  the  only 
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for  I  in  1..N  loop 

A(i)  :=  B(i)  *  C(i);  --  A(i)  is  stored  here. 

D(i)  :=  A(i-l)  /  2;  —    and  used  here  in  next  iteration. 
end  loop; 

declare 

—  Add  an  extra  iteration. 
subtype  i.range  is  integer  range  0..N; 
procedure  i_body(i  :  i-range)  is 
begin 

if  i  >  0  then  —  Must  checlc  the  end  points 

A(i)  :=  B(i)  *  C(i); 
end  if; 
if  i  <  N  then 

D(i  +  1)  :=  A(i)  /  2;  --  Shift  indices  of  D  and  A 
end  if; 
end  loop_body; 

package  iJoop  is  new  gen.minitask  (i_range,  i_body); 
begin 

if  not  i.loop. success  then 

—  Handle  error  condition  appropriately. 
end  if; 
end; 

Figure  6.10:  An  Example  of  Loop  Alignment 

variable  where  information  is  flowing,  we  can  align  the  two  statements  in  this 
loop  so  that  the  store  and  use  are  performed  in  the  same  iteration.  This  is  done 
in  the  second  part  of  Figure  6.10  where  the  resulting  loop  is  executed  in  parallel 
using  the  GMI. 

Although  this  transformation  appears  to  be  very  promising  it  can  only  be 
applied  to  the  special  case  illustrated  by  Figure  6.10,  when  no  other  flow  de- 
pendences are  adversely  affected  by  the  alignment.  For  more  details  about  this 
transformation  see  [ACK87]. 
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Conclusions 


In  an  interview  in  the  1984,  Jean  Ichbiah,  head  of  the  Ada  design  team  stated: 

"The  Ada  language  was  designed  with  three  overriding  concerns: 
a  recognition  of  the  importance  of  program  reliabihty  and  mainte- 
nance, a  concern  for  programming  as  a  human  activity,  and  effi- 
ciency."  [Ich84] 

In  this  work  we  have  addressed  the  last  of  these  concerns.  We  have  shown 
that  although  Ada's  tasking  model  provides  a  robust  method  for  specifying 
interprocess  communication,  its  semantics  hinder  the  reeJization  of  efficient  loop- 
level  parallelism. 

This  concluding  chapter  is  divided  into  three  sections.  First,  we  summarize 
our  work,  pointing  out  it  advantages  and  disadvantages.  Next,  we  discuss  the 
current  effort  to  revise  the  1983  standard  of  the  language.  Lastly,  we  outline 
some  future  areas  of  study. 
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7.1      Our  Work 

The  goal  of  this  work  is  to  provide  efficient  loop-level  parallelism  in  Ada  without 
modifying  the  language.  In  Chapter  2  we  discussed  three  significant  shortcom- 
ings that  must  be  overcome  before  this  goaJ  can  be  achieved. 

In  Chapter  3  we  presented  an  enhanced  version  of  a  previous  solution  to 
the  first  of  these  deficiencies,  the  bottleneck  that  results  when  one  wishes  to 
distribute  distinct  identities  to  a  group  of  tasks.  The  solution  presented,  the 
gen -beacon  generic  package,  is  more  efficient  in  both  time  and  space  then  its 
predecessor. 

We  addressed  the  latter  two  deficiencies,  the  synchronization  point  required 
between  task  creation  and  tcisk  activation  and  the  storage  overhead  reqmred 
to  manage  an  Ada  task,  in  the  remaining  chapters.  In  Chapter  4  we  explored 
possible  solutions  to  these  deficiencies  and  presented,  the  gen.minitask  generic 
package.  In  Chapter  5  we  described  the  implementation  of  this  idiom  in  detail 
and  explored  other  implementation  issues.  In  Chapter  6  we  showed  that  the 
overhead  and  efficiency  of  our  idiom  outperforms  the  standard  means  of  obtain- 
ing a  paradlel  loop  in  Ada.  Moreover,  we  showed  that  it  compares  favorably  with 
the  parallel  loop  construct  of  IBM  Parallel  Fortran. 

In  this  section  we  address  other  issues  concerning  our  work. 

Combining  Network  Dependence 

The  analysis  of  our  implementation  assumes  that  multiple  fetch. and  jdd  oper- 
ations to  the  same  memory  location  are  combined  in  the  network.    This  fact 
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might  lead  one  to  expect  our  work  to  be  only  applicable  to  machines  that  im- 
plement the  fetch_and_add  primitive  in  this  manner.  While  it  is  true  that  a 
combining  network  removes  the  serial  bottleneck  normally  associated  with  task 
initialization,  its  absence  does  not  preclude  an  efficient  implementation. 

In  Chapter  6  we  saw  that  the  time  to  initialize  one  minitask  compared  fa- 
vorably to  both  the  time  to  initialize  an  iteration  of  a  Parallel  Fortran  loop  and 
a  regular  Ada  task.  This  fact  is  independent  of  whether  multiple  initializations 
can  be  performed  in  parallel  (as  is  the  case  with  our  work  with  the  presence  of 
a  combining  network)  or  serially,  (as  done  in  the  other  two  methods).  For  this 
reason  we  conclude  that  our  work  can  provide  efficient  loop-level  paraJlehsm 
even  when  the  fetch_and_add  primitive  is  not  implemented  using  a  combining 
network. 

When  to  use  the  Gen_Minitask  Idiom 

In  Chapter  6  we  define  the  lower  bound  granularity  for  a  parallel  loop  construct 
and  compute  this  value  for  the  gen-minitask  idiom.  We  show  that  a  particular 
threshold  exists  where  use  of  our  idiom  is  expected  to  out  perform  a  sequential 
execution.  This  threshold  is  measured  in  terms  of  the  average  size  of  each 
iteration  and  is  a  function  of  the  number  of  iterations.  We  saw  that  although 
our  idiom  has  a  lower  threshold  than  other  paraJlel  loop  idioms,  it  should  not 
be  used  when  this  threshold  is  not  surpassed.  As  is  the  case  with  most  parallel 
loop  idioms  used  today,  a  potential  parallel  loop  that  has  a  small  body,  or  a 
small  number  iterations  should  be  executed  serially. 

It  remains  to  be  seen  whether  the  need  to  consider  this  threshold  will  exist  in 
the  future.  If  developments  in  hardware  and  software  can  be  found  that  reduce 
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the  overhead  associated  with  initiahzation.  issues  of  granularity  may  no  longer 
need  to  be  considered;  all  semantically  pau-allel  loops  will  be  executed  in  parallel. 

A  Limitation  of  the  Gen.Beacon  Idiom 

The  gen.beacon  idiom  provides  the  programmer  with  a  monitored  fetch_and  jdd 
shared  \-ariable.  Since  this  idiom  takes  the  form  of  a  package  instantiation  its 
declaration  appears  in  a  declarative  part,  as  is  the  case  with  other  x-ariable 
declarations.  We  provide  access  to  this  variable  through  several  subprograms. 
As  Ada  does  not  permit  subprogram  parameters,  we  are  unable  to  pass  access 
fetch. and_add  variables  to  subprograms. 

To  fullfil  this  need,  the  designers  of  the  language  suggest  the  using  generic 
subprograms.  In  Chapter  3  we  illustrate  this  hmitation  with  an  example  and 
show  how  it  can  be  overcome  with  the  use  of  generics. 

Furthermore,  recall  that  our  goal  in  constructing  the  gen.beacon  idiom  was  to 
provide  fetch.and^dd  \-ajiables  for  our  pao-allel  loop  idiom.  As  the  gen.minitask 
idiom  does  not  need  to  pass  fetch _and_add  \-ariables  as  paraimeters.  the  gen- 
.beacon idiom  reailizes  this  goed;  it  provides  zin  efficient  mechanism  for  obtaining 
fetch.and.add  vziriables  within  Ada. 

Readability  of  the  Gen.Minitask  Idiom 

We  have  constructed  the  gen.minitask  idiom  to  play  the  role  of  a  parallel  loop 
in  Ada  In  doing  so  we  have  substituted  a  declcirative  item  (a  package  instanti- 
ation) for  a  control  statement  (a  loop).  Despite  this  fact,  we  have  not  sacrificed 
any  expressive  power.  However,  the  resulting  program  can  be  difficult  to  read. 
A  possible  solution  to  this  problem  is  to  utibze  the  scheme  suggested  by  Dritz 
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(see  Chapter  2)  by  modifying  our  idiom  so  that  an  instantiation  would  provide 
subprograms  that  can  later  be  called  to  obtain  a  parallel  loop.  While  this  would 
place  the  parallel  loop  idiom  in  the  sequence  of  statements  section  of  the  pro- 
gram, it  would  not  remove  the  necessity  of  specifying  the  loop  body  and  iteration 
range  as  parameters. 

However,  whatever  we  lose  in  "ease  of  programming",  we  gain  in  rehability. 
Due  to  the  fact  that  our  idiom  is  a  package  instantiation,  we  Ccin  associate  more 
information  with  it  then  can  be  done  with  a  parallel  loop  control  structure.  In 
particular,  we  can  collect  status  information  about  each  thread  in  the  loop  and 
provide  functions  that  can  be  used  to  obtain  this  information,  something  that 
we  have  not  seen  in  any  other  parallel  loop  construct.  Furthermore,  both  of 
these  features  would  not  be  present  if  the  Dritz  scheme  were  used. 

7.2      Ada  9X 

To  maintain  portabihty,  we  have  placed  an  important  restriction  on  our  work: 
all  idioms  we  supply  must  be  written  in  standard  Ada.  During  the  development 
of  this  work,  a  revision  of  the  1983  standard  has  begun.  The  eventual  outcome 
of  this  process  will  be  a  new  standard  czdled  Ada  9X,  (where  X  is  a  digit  in 
the  range  0-9).  This  section  focuses  on  this  revision  process  with  particular 
emphasis  on  the  problems  we  have  discussed. 

In  October  1988,  the  Ada  9X  project  was  initiated  with  an  invitation  to 
the  public  to  submit  revision  requests.  This  period  concluded  in  October  1989 
and  produced  over  750  requests  for  language  changes.    Out  of  these  requests 
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the  Ada  9X  requirements  were  defined  and  made  public  [Ada90].  These  re- 
qxiirements  were  given  to  the  Mapping/Revision  Team  whose  goal  is  to  design 
specific  changes  to  the  Ada-83  standard  that  satisfy  these  requirements.  As 
these  changes  are  made,  the  Implementation/Analysis  Team  investigates  the 
ramifications  of  their  implementation.^ 

Although  many  of  the  Ada  9X  requirements  are  worthy  of  study,  we  refer 
the  interested  reader  to  [AdaQO]  for  a  full  description.  Instead,  we  point  out  the 
requirements  that  are  relevant  to  this  work. 

Recall  from  Chapter  2  the  three  shortcomings  in  the  Ada  tasking  model  when 
trying  to  achieve  loop-level  parallefism: 

1.  the  lack  of  the  ability  to  distribute  identities  to  tasks  in  parallel, 

2.  the  synchronization  point  that  is  required  during  task  initiation,  and 

3.  the  storage  overhead  required  to  manage  a  tzisk's  status. 

The  next  two  sections  specify  the  two  Ada  9X  requirements  that  most  closely 
address  these  deficiencies. 

7.2.1      Managing  Large  Number  of  Tasks 

Section  7.2  of  the  requirements  document  [AdaQO]  specifies: 

"Study  Topic  S7.2-A(l):  Ada  9X  must  provide  for  the  eflicient 
creation,  initialization,  execution,  and  termination  of  large  number 
of  tasks. 


'The  NYUAda  project  b&s  been  chosen  as  the  Implementation/Analysis  team  for  Ada  9X. 
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Discussion:  A  critical  requirement  for  efficiency  of  large  scale  par- 
allel applications  is  that  there  be  no  "serial  bottlenecks,"  i.e.  points 
at  which  the  execution  time  depends  on  executing  serial  code  whose 
execution  time  is  dependent  on  the  number  of  processors. 

Ada  83  does  provide  many  of  the  needed  facilities  to  meet  this  re- 
quirement. Tasks  can  be  initiated  in  parallel,  and  terminated  in 
parallel  (using  the  terminate  alternative).  However,  there  is  no  easy 
way  to  give  tasks  an  identity  so  that  the  tasks  that  are  initiated  in 
parallel  can  work  on  separate  parts  of  a  problem  without  communi- 
cating with  some  master  controlling  task. 

There  are  a  number  of  possible  approaches  to  solving  this  problem. 
One  approach  is  to  provide  a  task  with  the  ability  to  directly  deter- 
mine its  own  identity  (for  example,  its  index  in  an  array  of  tasks). 
Another  approach  is  to  provide  some  kind  of  mechanism  for  param- 
eterizing tasks." 

It  is  encouraging  to  see  that  this  requirement  addresses  the  first  shortcoming 
that  we  (and  others)  have  identified,  the  inability  of  a  task  to  determine  its  own 
identity.  It  appears  that  this  deficiency  was  an  oversight  in  the  original  design 
which  will  be  fixed  in  the  new  revision.  However,  the  same  cannot  be  said  for 
the  two  remaining  obstacles  to  efficient  loop-level  paradlelism.  Although  neither 
loop-level  parallelism  or  these  two  deficiences  are  addressed  by  the  requirements 
document,  the  next  section  does  discuss  related  issues. 
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7.2.2      Vector  Architectures 

This  requirement  discusses  statement  level  parallelism  in  vector  architectures. 
Although  the  topic  appears  to  be  out  of  the  context  of  our  work,  some  of  the 
issues  addressed  our  relevant. 

"Study  Topic  S7.3-A(l)  -  Statement  Level  Parallelism:  Ada 
9X  should  accommodate  compiler  techniques  for  efficiently  map- 
ping sequences  of  Ada  statements,  including  particularly  appropriate 
loops,  onto  vector  architectures. 

Discussion:  Although  Ada  permits  the  use  of  tasks  for  explicitly 
describing  parallelism,  this  is  much  too  heavy  a  mechanism  for  loop 
level  parallelism.  In  other  languages,  approaches  to  this  problem 
have  involved  automatic  recognition  of  loop  parallelism  by  compilers 
and  the  introduction  of  specialized  loop  constructs  explicitly  speci- 
fying parallel  execution. 

Compiler  techniques  for  mapping  sequential  code  onto  vector  archi- 
tectures are  well  known.  However,  there  is  some  difficulty  in  applying 
these  techniques  to  Ada  because  of  exception  semantics.  The  relcix- 
ations  permitted  by  section  11.6  of  the  Ada  stcindard  are  insufficient 
to  permit  vectorization  to  the  desired  degree.  For  example,  a  simple 
loop  that  adds  the  elements  of  two  vectors  together  cannot  be  vec- 
torized since  Ada  requires  that  an  exception  occurring  in  the  first 
[iteration  of  the]  loop  prevent  further  iteration[s]  of  the  loop." 
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Although  this  requirement  mentions  the  desirability  of  efficient  loop-level 
parcdlelism  in  Ada,  it  appears  to  rule  out  the  addition  of  a  language  construct 
that  would  allow  the  programmer  to  specify  this  form  of  paraJlelism.  Instead 
it  favors  a  compiler  detection  approach.  While  this  continues  to  be  an  area  of 
research,  it  seems  odd  that  the  requirements  team  would  rely  on  the  presence 
of  an  optimizing  compiler  to  provide  this  desirable  feature,  instead  of  allowing 
the  programmer  to  specify  this  type  of  parallelism  directly. 

In  addition,  it  appears  that  an  accommodation  for  the  parallelization  of  loops 
is  not  included  in  this  requirement.  Although  many  of  the  compiler  techniques 
used  for  vectorization  can  be  used  in  parallelization,  the  semantics  of  a  vector 
loop  differ  from  those  of  a  parallel  loop.  In  a  vector  loop  an  ordering  of  iterations 
is  defined;  a  loop  may  be  vectorized  even  though  certain  dependences  (anti-) 
exist  among  iterations.  However,  a  parallel  loop  contains  no  implicit  ordering; 
no  dependences  may  exist  among  iterations. 

Furthermore,  the  size  of  vector  loops  are  ultimately  limited  by  the  function- 
ality of  the  underlying  machine;  each  loop  is  constrained  to  contziin  only  one 
vector  assignment.  On  the  contrary,  the  size  of  a  parallel  loop  may  be  arbitrar- 
ily large.  Therefore,  if  the  compiler  detection  approach  is  used  in  Ada  9X  it 
appears  that  a  specific  reqmrement  to  accommodate  Ada  9X  to  paralleHzation 
would  be  needed. 

Both  of  the  requirements  we  have  cited  address  issues  that  are  important  in 
obtaining  loop-level  paraUehsm  in  Ada.  As  these  are  requirements,  it  is  difficult 
to  determine  what  language  revisions  will  eventually  result.  However,  these  re- 
quirements fail  to  mandate  the  resolution  of  two  of  the  three  deficiencies  that  we 
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have  cited.    Therefore,  it  unfortunately  appears  that  the  definition  and  imple- 
mentation of  these  requirements  will  not  provide  efficient  loop-level  parallelism. 

7.3     Future  Work 

We  now  focus  on  possible  extensions  of  our  work. 

•  This  work  has  provided  efficient  solutions  to  several  significant  shortcom- 
ings in  the  Ada  tasking  model.  We  have  obtained  these  solutions  by  spec- 
ifying Ada  idioms  that  take  the  form  of  a  generic  package.  This  technique 
has  two  major  advantages: 

—  Since  the  specification  is  written  solely  in  Ada,  portabihty  is  main- 
tained across  all  machines. 

—  Due  to  the  restricted  accessibility  of  the  contents  of  the  package, 
speciaJ  optimizations  can  be  performed  that  normzdly  would  not  be 
allowed  if  full  programmer  accessibility  were  present. 

Since  this  technique  is  not  specific  to  the  problems  we  have  addressed,  it 
may  be  used  to  develop  other  idioms  (tasking  and  non-tasking)  that  an 
implementor  may  feel  are  desirable.  This  includes,  but  is  not  limited  to, 
any  future  shortcomings  that  may  arise  out  of  the  Ada  9X  standard. 

•  Our  work  has  focused  on  providing  efficient  implementations  of  our  idioms 
on  a  shared  memory  multiprocessor.  A  logical  extension  of  this  work 
would  be  derive  efficient  implementations  of  these  idioms  for  alternative 
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architectures,  such  as  distributed  memory  machines,  vector  machines,  or 
even  sequential  machines. 

•  Another  possible  area  of  interest  would  be  to  attempt  to  find  additional 
parallelism  by  transforming  sequential  portions  of  an  Ada  program  to  the 
idioms  we  have  suggested.  These  idioms  could  then  be  executed  efficiently 
under  the  proper  implementation.  As  mentioned  in  Section  7.2.2,  the 
exception  semantics  of  Ada  make  it  more  challenging  to  accomplish  this 
goal.  Although  there  is  discussion  that  these  semantics  will  be  relaxed  in 
Ada  9X  to  allow  vectorization,  a  further  relaxation  would  be  required  for 
parallelization  to  occur. 

In  addition,  the  presence  of  both  multiple  threads  of  execution  and  shared 
variables  can  lead  to  dependences  that  are  not  present  when  sequential 
programs  are  parallelized  [SS88].  Parallehzation  of  Ada  programs  that 
contaiin  these  two  features  must  ensure  that  these  dependences  are  upheld. 
Despite  these  challenges,  we  feel  that  this  area  merits  attention. 

•  TraditioncJly,  the  control  flow  graph  has  provided  a  useful  representation 
for  performing  data  flow  analysis  on  sequential  programs  [ASU86].  More 
recently  this  graph  has  been  used  to  construct  the  control  dependence 
graph  [FOW87]  and  to  convert  a  program  into  static  single  assignment 
form  [CFR'''89].  Both  of  these  representations  can  be  used  to  provide 
more  efficient  algorithms  for  various  program  optimizations. 

As  the  control  flow  graph  is  central  to  all  of  this  work,  obtaining  an  equiv- 
alent representation  for  parallel  programs  (in  particular,  Ada)  would  be 
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beneficial.  Once  this  representation  is  obtained,  care  must  be  taken  to  en- 
sure that  the  algorithms  used  to  obtain  the  control  dependence  graph  and 
static  single  assignment  form  are  still  applicable.  Work  on  this  problem  as 
it  applies  to  a  parallel  version  of  FORTRAN  is  currently  being  done  by  Wolfe 
and  his  colleagues  [Wol91].  Since  Ada  tasking  semantics  provide  a  more 
powerful  mechanism  than  those  of  the  FORTRAN  being  studied,  devising  a 
representation  for  Ada  should  prove  to  be  more  chaJlenging. 


Appendix  A 


Examples 


In  this  chapter  we  illustrate  the  expressive  power  of  the  gen.minitask  idiom  by 
presenting  three  examples.  Each  of  these  examples  utihzes  loop-level  parcJlelism, 
making  them  suitable  candidates  for  our  idiom.  The  examples  we  present  are: 

calculate_pi:  A  program  that  computes  tt  by  integrating  ^r^  between  0  and 
1. 

gaussgordan:  A  program  that  performs  Gauss-Jordan  elimination  on  a  linear 
system  of  equations. 

niatrix_mult:   A  package  that  contains  a  matrix  multiplication  function. 

A.l      CalculateJPi 

This  program  computes  tt  by  integrating  ^^  over  the  intervzd  from  0  to  1 
[Bab88].  The  computation  is  based  on  the  vaJue  of  two  variables  given  by  the 
user  and  is  given  on  page  185.    The  first  variable,  numJntervals,  specifies  the 

182 
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precision  of  the  computation;  it  defines  the  number  of  intervals  between  0  and 
1.  The  second  variable,  numJasks,  specifies  the  amount  of  parallelism  to  be 
used  in  the  computation.  As  the  number  of  instructions  required  to  compute  an 
intervaJ  is  small,  it  is  desirable  to  have  numJntervals  larger  than  num.tasks. 

The  intervals  are  assigned  in  the  following  manner:  each  minitask  receives 
one  of  the  first  num.tasks  intervals.  After  computing  the  area  for  its  first  in- 
terval, a  minitask  acquires  its  next  intervaJ  by  adding  num.tasks  to  its  initial 
interval.  This  process  is  repeated  until  numJntervals  is  surpassed.  For  example, 
the  first  minitask  computes  intervals  1,  numJasks  -f  1,  2-  num.tasks  -f  1,  ..., 
: — r •  num.tasks  +  1. 

Current.interval  takes  on  values  spanning  from  1  to  numJntervals.  As  it 
should  be  representative  of  the  range  of  values  from  current.interval  -  1  to  cur- 
rent Jnterval,  0.5  is  subtracted  to  obtain  an  average.  To  determine  the  value  of 
X,  currentJnterval  -  0.5  is  scaled  down  according  to  the  number  of  intervzJs,  i.e. 
it  is  divided  by  numJntervals.  This  value  is  substituted  in  y^j^^  to  determine  the 
y  value  for  this  interval.  This  value  is  used  to  obtain  the  area  represented  by 
this  interval;  it  is  the  product  of  the  y  value  and  the  interval.width.  Lastly  this 
product  is  added  to  the  Iccal^um  for  this  minitask. 

As  the  computation  for  this  interval  is  complete,  an  attempt  is  made  to  com- 
pute the  minitask's  next  interval;  currentJnterval  is  incremented  by  num.tasks. 
A  check  is  performed  to  see  if  this  new  value  represents  a  valid  interval.  If  it 
does,  this  process  continues.  However,  if  there  are  no  more  intervals  for  this  task 
to  compute,  thf  local  sum  value  is  stored  into  the  shared  array  sum.array  in  the 
appropriate  index. 

When  the  main  program  regains  control  after  the  begin  statement  following 
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the  package  instantiation,  a  check  is  performed  on  the  family  of  minitasks.  If  the 
value  true  is  returned,  the  loccil  sums  of  each  minitask  are  totaled  and  reported. 
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with  text_io;  use  text_io; 
with  gen.minitask; 
procedure  calculate. pi  is 

pi  :  float; The  final  result 

num. tasks,  numJntervals  :  integer; Input 

sum.array  :  array  (num_tasks_type)  of  float; 

package  int_io  is  new  integer_io(integer);  use  intJo; 

package  my.float  is  new  floatjo(float);  use  my  .float; 
begin  —  Acquire  number  of  threads  and  intervals. 

put("Enter  number  of  tasks:  ");  get(num. tasks); 

put("  Enter  number  of  intervals:  ");  get(num_intervals);  new/. line; 

declare  —  Call  subroutine  concurrently  to  do  work, 
subtype  iteration. range  is  integer  range  l.num.tasks; 
procedure  iteration_body(iterate  :  iteration.range)  is 
local.sum  :  float  :=  0.0; 

current. interval  :  integer; Current  interval 

X  :  float; Value  of  a-  used  for  integration 

begin 

current. interval  :=  iterate; 

while  current. interval  <=  num. intervals  loop 

X  :=  (fioat(current. interval)  -  0.5)  /  float(num. intervals) 
local.sum  :=  jocajjum  +  interval. width  *  4  0  /  (1.0  +  x  *  x); 
current.interval  :=  current.interval  +  num.tasks; 
end  loop; 

$um.array(iterate)  :=  local.^um,  —  Deposit  local  sum. 
end  iteration. body; 

package  procnum.minitask  ia  new 

gen . mini  task(it  era  tion.  range, iteration. body); 
begin 
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if  procnum_minitask. success  then 
for  i  in  num_tasks_type  loop 

pi  :=  pi  +  sum-array(i); 
end  loop; 
else 

raise  minltask.exception;  —  Minitask  error  condition, 
end  if; 
end; 
end  calculate.pi; 

A. 2      Gauss_Jordan 

This  program  solves  a  linear  system  AX  =  B,  where  A  is  an  A'^  by  N  matrix, 
X  is  an  A^  by  M  matrix,  and  5  is  an  A^  by  M  matrix.  The  method  used  is 
Gauss- Jordan  elimination  as  described  by  Heller  [Hel78]. 

Two  versions  of  the  aJgorithm  are  supplied.  The  first  performs  Gauss-Jordan 
elimination  by  executing  each  of  the  parallel  loops  sequentially.  The  second  pro- 
cedure utilizes  the  gen_minitask  idiom  to  express  the  inherent  loop-level  paral- 
lelism of  the  algorithm. 

Both  procedures  are  passed  two  matrices,  A  and  X,  as  parameters.  For 
convenience,  matrices  A  and  B  are  passed  as  one  parameter,  AB,  (5  extended 
to  the  right  of  A).  Upon  conclusion  of  the  algorithm,  out  parameter  matrix  X 
contains  the  solution.  The  bounds  of  these  parameters  are  obtained  by  using 
the  predefined  attribute  operations,  'RANGE  and  'LAST.  In  computing  X, 
the  algorithm  overwrites  matrix  AB. 

One  optimization  is  performed  in  the  gen_minitask  version  in  the  second  set 
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of  nested  parallel  loops.  As  no  operations  are  performed  inside  of  the  outer 
loop  and  outside  the  inner  loop,  these  loops  are  coalesced  [Pol87].  This  transfor- 
mations eliminates  the  spawning  of  the  N  minitask  families  without  decreasing 
parallelism. 

procedure  seriaLgaus5.jordan(AB:  in  out  matrix;  X:  out  matrix)  is 

mult  :  float; 
begin 

for  J  in  AB'range(I)  loop 

for  i  in  AB'range(I)  loop  —  Parallel  Loop 
if  i  /=  j  then 

mult  :=  AB(i,j)/AB(j,j); 

for  k  in  j  +  1  ..  AB'LAST(2)  loop  —  Parallel  Loop 

AB(i,  k)  :=  AB(i.  k)  -  mult  •  AB(j,  k); 
end  loop; 
een  if; 
end  loop; 
end  loop; 

for  i  in  X'range(I)  loop  —  Parallel  Loop 
for  j  in  X'range(2)  loop  —  Partdlel  Loop 
X(i,  j)  :=  AB(i,  j  +  AB'LAST(l))/AB(i,  i); 
end  loop; 
end  loop; 
end  terial.gauss.jordan; 
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procedure  paralleLgauss.jordan(AB:  in  out  matrix;  X:  out  matrix)  is 

mult  :  float; 
begixi 

for  j  in  AB'RANGE(I)  loop  --  Sequential  loop 
declare 

subtype  i. range  is  integer  range  AB'range(I); 

procedure  i_body(i  :  i.range)  is Body  of  i  loop  (par) 

begin 

if  i  =  j  then 

return; 
end  if; 

mult  :=  AB(i,j)/AB(j.j); 
declare 

subtype  k. range  is  integer  range  AB'range(2); 

procedure  k-body(k  :   k_range)  is Body  of  k  loop  (par) 

begin 

AB(i.  k)  :=  AB(i,  k)  -  mult  *  AB(j,  k); 
end  k.body; 

package  k.minitask  is  new  gen_minitask(k. range,  k.body); 
begin 

if  not  ij.minitask. success  then 

raise  minitask.error; 
end  if; 
end; 
end  i-body; 

package  i.minitask  is  new  gen_minitask(i_range,  i_body); 
begin 

if  not  ij.minitask. success  then 
raise  minitask.error; 
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end  if: 
end 
end  loop 

declare  —  Second  nest  of  loops. 

subtype  ij-range  is  integer  range  1       X'lENGTH(I)  *  X'LENGTH(2); 
procedure  ij-body(ij  :  ijjange)  is  —  i  A:  j  loops  combined 

i,  j     integer; 
begin 

i  ;=  (ij-1)  div  Xlength(2)  +  XfirST(I); Compute  appropriate  value  of  i 

j  :=  (ij-1)  mod  Xlength(2)  +  X'first(2),  —  Compute  appropriate  value  of  j. 
X(i.  j)  =  AB(i.  j  +  ABlast(I)  /  AB(i,  i): 
end  ij.body. 

package  Ijjninitask  is  new  gen.minitask(ijj-ange.  ij^ody); 
begin 

if  not  ij.minilask  success  then 

raise  minitask.error, 
end  if 
end. 
end  parallel-gauss.jordan; 


A. 3      Matrix  Multiplication 

Our  final  example  is  a  package  version  of  matrix  multiplication.  This  package 
contains  two  visible  entities:  an  unconstrained  type  declaration  (matrix)  euid  a 
function  (matmult)  that  multiplies  two  matrices  and  returns  the  product. 

Thefunrtion  matmult  takes  the  two  matrices,  A  and  B  a5  parzimeters  and  once 
again  uses  the  predefined  array  attributes  to  determine  the  size  of  these  matrices. 
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In  a  standard  paraJlelization  of  matrix  multiplication  the  outer  two  loops  are 
executed  in  parallel.  Although  this  can  be  achieved  by  nesting  gen_minitask 
instantiations,  we  coalesce  these  two  parallel  loops,  we  choose  not  to  nest  uses 
of  our  idiom  for  the  usual  performance  reasons;  by  combining  the  two  outer  loops 
into  one  loop  we  reduce  the  number  of  forks  (i.e.  genjninitask  instantiations) 
from  a'LENGTH  +1  to  1.  This  user-specified  optimization  is  typical  when  dealing 
with  nested  parallel  loops. 

The  body  of  the  parallel  loop  is  represented  by  the  procedure  ij.body.  The 
parameter  of  the  procedure  plays  the  role  of  the  iteration  variable.  After  val- 
ues for  the  i  and  j  variables  have  been  computed,  we  perform  the  standard 
computation  for  the  innermost  loop  of  matrix  multiplication. 


APPENDIX  A.    EXAMPLES  191 

■with  gen.minitask; 
package  matmult.pack  is 

type  matrix  is  array(integer  range  <>,  integer  range  <>)  of  float; 

function  matmult{A,  B  :  matrix)  return  matrix; 
end  matmult.pack; 

package  body  matmult.pack  is 

function  matmult(A,  B  :  matrix)  return  matrix  is 
C  :  matrix(A'RANGE(l),  B'range(2)); 

subtype  ij.range  is  integer 

range  1  ..  A'lengTH(I)  •  B'lENGTh(2); 

procedure  ij.body(ij  :  ij.range)  is 

i,  j  :  integer; 
begin 

i  :=  (ij-1)  div  B'lengTH(2)  +  A'first(I); 

j  :=  (ij-1)  mod  B'length(2)  +  B'first(2); 

C(.,j):=0.0, 

—  For  each  of  B's  rows 

for  k  in  B'range(I)  loop 

C(i.j):=C(,,j)  +  A(,,  k)*  B(k,j); 

end  loop; 
end  ij.body; 
package  ij.minitask  is  new  gen.minita$k(ij.range,  tjJ)ody); 

begin 

if  not  ijminitask. success  then 
for  I  in  Ij.range  loop 

if  not  task.completion(i)  then 

—  Wc  can  use  tiisk.exception(i)  to 
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query  which  exception  was  raised. 

end  if; 
end  loop; 

We  infoim  the  caller  about  the  exception(s) 

by  raising  a  user-defined  exception. 

raise  minitask.exception; 
end  if; 
return  C; 
end  matmult; 
end  matmult. pack; 
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